0.1 Libraries Required

1 1. Introduction

This project investigates the relationship between county-level broadband connectivity and social vulnerability across the United States.

Data Year Configuration: This analysis can use either December 2020 or May 2025 FCC broadband data. Set USE_FCC_2025 <- TRUE (default) or FALSE in the setup chunk above.

We use:

This project follows a fully reproducible workflow using R Markdown, where all data cleaning, merging, analysis, and visualizations are generated inside this document. # 1.1 Data Inventory Table

  1. Social Vulnerability Index (SVI)

Dataset Name : CDC/ATSDR Social Vulnerability Index (SVI) Source / URL : https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html Year Used : 2020 Geographic Level: County (FIPS) Format: CSV Files Used: SVI2020_US_COUNTY.csv Size: ~5–6 MB Key Variables: County FIPS, overall SVI percentile, SVI Theme scores (Socioeconomic, Household Composition, Minority Status, Housing/Transportation) Data Quality Notes: Percentile values range 0–1; no SVI data for Puerto Rico; a few suppressed values for remote Alaska counties.

  1. FCC Broadband Deployment Data (December 2020) Dataset Name : FCC Form 477 Fixed Broadband Deployment Summary Source / URL : FCC Broadband Deployment Data Year Used: Dec 2020 Geographic Level: County Format: CSV Files Used: fcc_dec_2020_county.csv (cleaned version) Size: ~20–30 MB Key Variables: FIPS, availability of ≥25/3 Mbps broadband, provider counts, technology codes Data Quality Notes: Known issues include over-reporting availability; no coverage for Puerto Rico or territories.

  2. Microsoft Airband Broadband Usage

Dataset Name: Microsoft Airband “US Broadband Usage” Source / URL: https://github.com/microsoft/USBroadbandUsage Year Used: 2020 Geographic Level : County Format : CSV Files Used : airband_2020_county.csv Size: ~2 MB Key Variables : fips, measured broadband usage, observed availability Data Quality Notes : Usage values are decimals (0–1). No PR or territories.

  1. ACS Data — Computer & Internet Use (Table B28002)

Dataset Name : ACS 5-Year Estimates — B28002 Source / URL : data.census.gov Year Used : 2020 Geographic Level : County Format : CSV Files Used : ACSDT5Y2020.B28002-Data.csv Size : ~4 MB Key Variables : Total households, households with broadband, households with no internet access Data Quality Notes : Minor MOE columns removed; numeric conversion required.

  1. ACS Data — Median Household Income (Table B19013)

Dataset Name : ACS 5-Year Estimates — B19013 Source / URL : data.census.gov Year Used : 2020 Geographic Level : County Format : CSV Files Used : ACSDT5Y2020.B19013-Data.csv Size : ~3 MB Key Variables : Median household income Data Quality Notes : No PR values; income suppressed for some remote Alaska regions.

  1. ACS Data — Education Attainment (Table B15003)

Dataset Name : ACS 5-Year Estimates — B15003 Source / URL : data.census.gov Year Used : 2020 Geographic Level : County Format : CSV Files Used : ACSDT5Y2020.B15003-Data.csv Size : ~4 MB Key Variables : Education levels (HS, bachelor’s, master’s, doctorate) Data Quality Notes :Consistent coverage except PR and some small Alaska areas.

  1. Geographic Shapefiles (TIGER/Line 2020)

Dataset Name : TIGER/Line 2020 County Shapefile Source / URL : https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html Year Used : 2020 Geographic Level : County (MULTIPOLYGON) Format : ESRI Shapefile (.shp, .dbf, .shx, .prj) Files Used : tl_2020_us_county.shp + associated files Size : ~50–100 MB Key Variables : GEOID, county name, state FIPS, geometry Data Quality Notes : Includes 3,234 county-equivalents → includes Puerto Rico, territories, and Alaska census areas. Must be filtered before analysis.

2 2. FCC Broadband Data (2016–2024, focusing on 2020)

In this section, we work with the FCC Form 477 county-level broadband tier dataset.
The raw file contains semiannual observations (June and December) for each county from 2016–2024.

FCC Form 477 includes both deployment (availability) and subscription (adoption) datasets. The deployment dataset — used in this project — measures broadband availability across speed tiers (Tier 1–4) and provides a county-level snapshot of the infrastructure that is available to residents. This is distinct from the subscription dataset, which reflects how many households actually subscribe to broadband. Because our project examines disparities in access, we chosse the deployment dataset.

2.1 2.1 Load and Inspect FCC Raw Data

# Load the raw FCC county tier data file
fcc_raw <- read_csv(
  path_raw("fcc", "form477_fcc_data.csv")
)

# View column names and structure
names(fcc_raw)
##  [1] "Year"          "Month"         "FIPS"          "State"        
##  [5] "County"        "State_Name"    "County_Name"   "Housing_Units"
##  [9] "Tier_1"        "Tier_2"        "Tier_3"        "Tier_4"
glimpse(fcc_raw)
## Rows: 67,926
## Columns: 12
## $ Year          <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 20…
## $ Month         <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ FIPS          <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017, 10…
## $ State         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ County        <dbl> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 3…
## $ State_Name    <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "…
## $ County_Name   <chr> "Autauga County", "Baldwin County", "Barbour County", "B…
## $ Housing_Units <dbl> 25318, 135669, 11816, 9208, 25065, 4616, 9911, 53540, 16…
## $ Tier_1        <dbl> 5, 4, 3, 3, 4, 3, 4, 4, 4, 3, 4, 3, 3, 3, 3, 4, 4, 3, 4,…
## $ Tier_2        <dbl> 5, 4, 3, 3, 3, 3, 4, 4, 4, 3, 4, 2, 3, 3, 3, 4, 4, 2, 4,…
## $ Tier_3        <dbl> 5, 4, 3, 2, 3, 3, 3, 4, 4, 2, 3, 2, 2, 2, 2, 4, 4, 2, 4,…
## $ Tier_4        <dbl> 4, 3, 3, 2, 2, 3, 2, 3, 4, 1, 2, 1, 1, 2, 2, 4, 3, 1, 3,…

2.2 2.2 Clean FCC Data: Keep Only December 2020

fcc_2020_dec <- fcc_raw %>%
  rename_with(tolower) %>%
  filter(year == 2020, month == 12) %>%
  mutate(
    county_fips = str_pad(as.character(fips), 5, side = "left", pad = "0")
  ) %>%
  select(
    county_fips,
    state_name,
    county_name,
    housing_units,
    tier1 = tier_1,
    tier2 = tier_2,
    tier3 = tier_3
  )
head(fcc_2020_dec)
## # A tibble: 6 × 7
##   county_fips state_name county_name    housing_units tier1 tier2 tier3
##   <chr>       <chr>      <chr>                  <dbl> <dbl> <dbl> <dbl>
## 1 01001       Alabama    Autauga County         24165     4     4     4
## 2 01003       Alabama    Baldwin County        122518     4     3     3
## 3 01005       Alabama    Barbour County         12120     3     3     2
## 4 01007       Alabama    Bibb County             9340     3     2     1
## 5 01009       Alabama    Blount County          24625     3     3     2
## 6 01011       Alabama    Bullock County          4625     3     3     1
summary(fcc_2020_dec)
##  county_fips         state_name        county_name        housing_units    
##  Length:3234        Length:3234        Length:3234        Min.   :      0  
##  Class :character   Class :character   Class :character   1st Qu.:   5620  
##  Mode  :character   Mode  :character   Mode  :character   Median :  12697  
##                                                           Mean   :  44079  
##                                                           3rd Qu.:  31556  
##                                                           Max.   :3599420  
##      tier1           tier2           tier3      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :4.000   Median :3.000   Median :3.000  
##  Mean   :3.681   Mean   :3.272   Mean   :2.717  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000

2.3 2.3 Broadband validation and consistency checks

# If not already in memory, load the non-spatial master dataset
master_2020 <- readRDS("processed_data/master_2020_county.rds")

broadband_check <- master_2020 %>%
  select(
    FIPS,
    housing_units,
    tier1, tier2, tier3,
    airband_fcc_availability,
    airband_usage
  )

# 1) Range checks for FCC/Airband measures
summary(broadband_check)
##      FIPS           housing_units         tier1       tier2      
##  Length:3141        Min.   :    221   Min.   :4   Min.   :1.000  
##  Class :character   1st Qu.:   6856   1st Qu.:4   1st Qu.:2.000  
##  Mode  :character   Median :  15173   Median :4   Median :4.000  
##                     Mean   :  50430   Mean   :4   Mean   :3.104  
##                     3rd Qu.:  36778   3rd Qu.:4   3rd Qu.:4.000  
##                     Max.   :3868643   Max.   :4   Max.   :4.000  
##                                                                  
##      tier3       airband_fcc_availability airband_usage   
##  Min.   :1.000   Min.   :0.0002           Min.   :0.0010  
##  1st Qu.:2.000   1st Qu.:0.7698           1st Qu.:0.1960  
##  Median :2.000   Median :0.9244           Median :0.3680  
##  Mean   :2.499   Mean   :0.8404           Mean   :0.3896  
##  3rd Qu.:3.000   3rd Qu.:0.9839           3rd Qu.:0.5660  
##  Max.   :4.000   Max.   :1.0000           Max.   :1.0000  
##                  NA's   :9                NA's   :11
# Check that availability and usage are between 0 and 1
invalid_range <- broadband_check %>%
  filter(
    airband_fcc_availability < 0 | airband_fcc_availability > 1 |
    airband_usage            < 0 | airband_usage            > 1
  )

invalid_range   # ideally 0 rows
## # A tibble: 0 × 7
## # ℹ 7 variables: FIPS <chr>, housing_units <dbl>, tier1 <int>, tier2 <int>,
## #   tier3 <int>, airband_fcc_availability <dbl>, airband_usage <dbl>
# Check that tiers (if present) are within 1–4
invalid_tiers <- broadband_check %>%
  filter(
    (!is.na(tier1) & (tier1 < 1 | tier1 > 4)) |
    (!is.na(tier2) & (tier2 < 1 | tier2 > 4)) |
    (!is.na(tier3) & (tier3 < 1 | tier3 > 4))
  )

invalid_tiers   # ideally 0 rows
## # A tibble: 0 × 7
## # ℹ 7 variables: FIPS <chr>, housing_units <dbl>, tier1 <int>, tier2 <int>,
## #   tier3 <int>, airband_fcc_availability <dbl>, airband_usage <dbl>
# 2) Missingness in key broadband fields
colSums(is.na(broadband_check))
##                     FIPS            housing_units                    tier1 
##                        0                        0                        0 
##                    tier2                    tier3 airband_fcc_availability 
##                        0                        0                        9 
##            airband_usage 
##                       11
# 3) Simple consistency check: usage should not greatly exceed availability
usage_gt_avail <- broadband_check %>%
  filter(
    !is.na(airband_fcc_availability),
    !is.na(airband_usage),
    airband_usage > airband_fcc_availability
  )

usage_gt_avail %>% head()
## # A tibble: 6 × 7
##   FIPS  housing_units tier1 tier2 tier3 airband_fcc_availability airband_usage
##   <chr>         <dbl> <int> <int> <int>                    <dbl>         <dbl>
## 1 01063          5979     4     1     1                   0.0085         0.013
## 2 01105          5751     4     1     1                   0.0012         0.032
## 3 02185          3534     4     4     1                   0.0061         0.051
## 4 02188          3329     4     4     1                   0.0081         0.039
## 5 02198          4267     4     4     1                   0.0282         0.04 
## 6 02290          5467     4     4     1                   0.013          0.139
nrow(usage_gt_avail)
## [1] 50

FCC Data Validation Summary

After cleaning the FCC December 2020 county-level dataset, we verified the quality of all key variables before merging them into the master file. All FCC tiers (tier1, tier2, tier3) fall within the correct range of 1–4 with no invalid values detected. Housing-unit counts are strictly positive for all counties, and no negative, zero, or corrupted values were found. The FCC dataset therefore passes internal consistency checks and is suitable for downstream analysis.

2.4 2.4 Save Cleaned FCC Dataset

write_csv(fcc_2020_dec, path_processed("fcc_2020_dec_clean.csv"))
saveRDS(fcc_2020_dec, path_processed("fcc_2020_dec_clean.rds"))

3 3. Social Vulnerability Index (SVI) Data – 2020 (County Level)

The CDC/ATSDR Social Vulnerability Index (SVI) provides an overall vulnerability percentile and four theme-level indices for each U.S. county. Here we load the 2020 county-level SVI dataset.

3.1 3.1 Load and Inspect SVI Raw Data

svi_raw <- read_csv(
  path_raw("svi", "SVI_2020_US_county.csv")
)

names(svi_raw)
##   [1] "ST"           "STATE"        "ST_ABBR"      "STCNTY"       "COUNTY"      
##   [6] "FIPS"         "LOCATION"     "AREA_SQMI"    "E_TOTPOP"     "M_TOTPOP"    
##  [11] "E_HU"         "M_HU"         "E_HH"         "M_HH"         "E_POV150"    
##  [16] "M_POV150"     "E_UNEMP"      "M_UNEMP"      "E_HBURD"      "M_HBURD"     
##  [21] "E_NOHSDP"     "M_NOHSDP"     "E_UNINSUR"    "M_UNINSUR"    "E_AGE65"     
##  [26] "M_AGE65"      "E_AGE17"      "M_AGE17"      "E_DISABL"     "M_DISABL"    
##  [31] "E_SNGPNT"     "M_SNGPNT"     "E_LIMENG"     "M_LIMENG"     "E_MINRTY"    
##  [36] "M_MINRTY"     "E_MUNIT"      "M_MUNIT"      "E_MOBILE"     "M_MOBILE"    
##  [41] "E_CROWD"      "M_CROWD"      "E_NOVEH"      "M_NOVEH"      "E_GROUPQ"    
##  [46] "M_GROUPQ"     "EP_POV150"    "MP_POV150"    "EP_UNEMP"     "MP_UNEMP"    
##  [51] "EP_HBURD"     "MP_HBURD"     "EP_NOHSDP"    "MP_NOHSDP"    "EP_UNINSUR"  
##  [56] "MP_UNINSUR"   "EP_AGE65"     "MP_AGE65"     "EP_AGE17"     "MP_AGE17"    
##  [61] "EP_DISABL"    "MP_DISABL"    "EP_SNGPNT"    "MP_SNGPNT"    "EP_LIMENG"   
##  [66] "MP_LIMENG"    "EP_MINRTY"    "MP_MINRTY"    "EP_MUNIT"     "MP_MUNIT"    
##  [71] "EP_MOBILE"    "MP_MOBILE"    "EP_CROWD"     "MP_CROWD"     "EP_NOVEH"    
##  [76] "MP_NOVEH"     "EP_GROUPQ"    "MP_GROUPQ"    "EPL_POV150"   "EPL_UNEMP"   
##  [81] "EPL_HBURD"    "EPL_NOHSDP"   "EPL_UNINSUR"  "SPL_THEME1"   "RPL_THEME1"  
##  [86] "EPL_AGE65"    "EPL_AGE17"    "EPL_DISABL"   "EPL_SNGPNT"   "EPL_LIMENG"  
##  [91] "SPL_THEME2"   "RPL_THEME2"   "EPL_MINRTY"   "SPL_THEME3"   "RPL_THEME3"  
##  [96] "EPL_MUNIT"    "EPL_MOBILE"   "EPL_CROWD"    "EPL_NOVEH"    "EPL_GROUPQ"  
## [101] "SPL_THEME4"   "RPL_THEME4"   "SPL_THEMES"   "RPL_THEMES"   "F_POV150"    
## [106] "F_UNEMP"      "F_HBURD"      "F_NOHSDP"     "F_UNINSUR"    "F_THEME1"    
## [111] "F_AGE65"      "F_AGE17"      "F_DISABL"     "F_SNGPNT"     "F_LIMENG"    
## [116] "F_THEME2"     "F_MINRTY"     "F_THEME3"     "F_MUNIT"      "F_MOBILE"    
## [121] "F_CROWD"      "F_NOVEH"      "F_GROUPQ"     "F_THEME4"     "F_TOTAL"     
## [126] "E_DAYPOP"     "E_NOINT"      "M_NOINT"      "E_AFAM"       "M_AFAM"      
## [131] "E_HISP"       "M_HISP"       "E_ASIAN"      "M_ASIAN"      "E_AIAN"      
## [136] "M_AIAN"       "E_NHPI"       "M_NHPI"       "E_TWOMORE"    "M_TWOMORE"   
## [141] "E_OTHERRACE"  "M_OTHERRACE"  "EP_NOINT"     "MP_NOINT"     "EP_AFAM"     
## [146] "MP_AFAM"      "EP_HISP"      "MP_HISP"      "EP_ASIAN"     "MP_ASIAN"    
## [151] "EP_AIAN"      "MP_AIAN"      "EP_NHPI"      "MP_NHPI"      "EP_TWOMORE"  
## [156] "MP_TWOMORE"   "EP_OTHERRACE" "MP_OTHERRACE"
glimpse(svi_raw)
## Rows: 3,143
## Columns: 158
## $ ST           <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01…
## $ STATE        <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "A…
## $ ST_ABBR      <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL…
## $ STCNTY       <chr> "01001", "01003", "01005", "01007", "01009", "01011", "01…
## $ COUNTY       <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "Bullo…
## $ FIPS         <chr> "01001", "01003", "01005", "01007", "01009", "01011", "01…
## $ LOCATION     <chr> "Autauga County, Alabama", "Baldwin County, Alabama", "Ba…
## $ AREA_SQMI    <dbl> 594.4558, 1589.8353, 885.0076, 622.4693, 644.8904, 622.81…
## $ E_TOTPOP     <dbl> 55639, 218289, 25026, 22374, 57755, 10173, 19726, 114324,…
## $ M_TOTPOP     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ E_HU         <dbl> 23697, 116747, 12057, 9237, 24404, 4591, 10119, 53722, 16…
## $ M_HU         <dbl> 68, 180, 119, 82, 87, 60, 57, 190, 70, 68, 60, 42, 63, 33…
## $ E_HH         <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, 44572, 13582…
## $ M_HH         <dbl> 366, 1143, 338, 299, 430, 195, 282, 642, 418, 476, 358, 2…
## $ E_POV150     <dbl> 12611, 36413, 8965, 5730, 13624, 4128, 6133, 30323, 9277,…
## $ M_POV150     <dbl> 1349, 2802, 673, 907, 1151, 593, 794, 2125, 926, 842, 121…
## $ E_UNEMP      <dbl> 736, 4027, 649, 667, 1253, 139, 551, 3769, 548, 444, 1153…
## $ M_UNEMP      <dbl> 185, 714, 182, 260, 301, 107, 194, 560, 195, 170, 324, 15…
## $ E_HBURD      <dbl> 5029, 19350, 2305, 1580, 4060, 695, 1383, 9451, 2932, 196…
## $ M_HBURD      <dbl> 576, 1526, 298, 319, 431, 180, 218, 766, 368, 347, 482, 1…
## $ E_NOHSDP     <dbl> 4273, 14823, 4497, 3056, 6838, 1700, 1886, 11787, 4140, 3…
## $ M_NOHSDP     <dbl> 562, 1344, 387, 414, 499, 285, 286, 887, 488, 499, 686, 2…
## $ E_UNINSUR    <dbl> 4345, 20501, 2362, 1878, 5746, 1025, 1973, 11260, 3045, 2…
## $ M_UNINSUR    <dbl> 725, 1932, 311, 387, 695, 290, 380, 970, 454, 429, 804, 3…
## $ E_AGE65      <dbl> 8490, 44716, 4777, 3676, 10382, 1638, 4092, 20245, 6530, …
## $ M_AGE65      <dbl> 85, 109, 16, 125, 106, 19, 98, 127, 45, 67, 69, 0, 11, 58…
## $ E_AGE17      <dbl> 13143, 46993, 5222, 4584, 13372, 2279, 4401, 24769, 6946,…
## $ M_AGE17      <dbl> 49, 0, 29, 0, 85, 147, 146, 0, 68, 73, 27, 0, 27, 0, 35, …
## $ E_DISABL     <dbl> 9658, 30615, 4159, 3748, 8564, 1203, 3067, 22943, 6113, 4…
## $ M_DISABL     <dbl> 856, 1565, 287, 551, 775, 291, 317, 1157, 518, 522, 736, …
## $ E_SNGPNT     <dbl> 1608, 3317, 1029, 790, 1313, 499, 500, 2972, 895, 385, 10…
## $ M_SNGPNT     <dbl> 302, 547, 217, 252, 262, 200, 152, 387, 242, 130, 309, 12…
## $ E_LIMENG     <dbl> 363, 1593, 433, 75, 801, 150, 59, 754, 111, 275, 719, 25,…
## $ M_LIMENG     <dbl> 229, 486, 205, 107, 263, 114, 88, 226, 110, 181, 328, 81,…
## $ E_MINRTY     <dbl> 14479, 37334, 13694, 5724, 7690, 8011, 9601, 32275, 15014…
## $ M_MINRTY     <dbl> 566, 1363, 266, 158, 343, 199, 140, 645, 216, 251, 620, 1…
## $ E_MUNIT      <dbl> 918, 19513, 170, 228, 167, 73, 114, 1977, 1432, 250, 184,…
## $ M_MUNIT      <dbl> 292, 979, 77, 112, 95, 79, 49, 299, 308, 105, 123, 31, 96…
## $ E_MOBILE     <dbl> 4313, 11893, 3644, 2943, 6043, 1919, 2349, 8364, 2281, 42…
## $ M_MOBILE     <dbl> 440, 868, 347, 327, 515, 271, 272, 657, 335, 390, 551, 31…
## $ E_CROWD      <dbl> 339, 1280, 350, 91, 385, 33, 72, 547, 974, 151, 455, 52, …
## $ M_CROWD      <dbl> 162, 364, 192, 58, 139, 34, 44, 147, 332, 77, 174, 46, 83…
## $ E_NOVEH      <dbl> 1167, 2627, 1039, 481, 1077, 418, 359, 2270, 1000, 343, 6…
## $ M_NOVEH      <dbl> 320, 397, 202, 226, 275, 192, 97, 317, 243, 110, 212, 126…
## $ E_GROUPQ     <dbl> 578, 2954, 2910, 1657, 564, 433, 320, 2979, 558, 274, 429…
## $ M_GROUPQ     <dbl> 157, 403, 264, 221, 144, 163, 139, 461, 163, 145, 134, 61…
## $ EP_POV150    <dbl> 22.9, 16.9, 40.6, 27.6, 23.8, 42.5, 31.6, 27.3, 28.2, 25.…
## $ MP_POV150    <dbl> 2.4, 1.3, 3.0, 4.4, 2.0, 6.0, 4.1, 1.9, 2.8, 3.3, 2.8, 3.…
## $ EP_UNEMP     <dbl> 2.9, 3.9, 6.9, 7.4, 5.2, 3.4, 6.5, 7.3, 3.6, 4.1, 6.1, 6.…
## $ MP_UNEMP     <dbl> 0.7, 0.7, 1.9, 2.8, 1.2, 2.5, 2.3, 1.1, 1.2, 1.6, 1.7, 3.…
## $ EP_HBURD     <dbl> 23.3, 23.0, 24.7, 21.8, 19.1, 20.3, 20.8, 21.2, 21.6, 18.…
## $ MP_HBURD     <dbl> 2.6, 1.8, 3.1, 4.3, 2.0, 5.1, 3.2, 1.7, 2.6, 3.1, 2.8, 3.…
## $ EP_NOHSDP    <dbl> 11.3, 9.5, 25.3, 19.1, 17.2, 25.1, 13.6, 14.9, 17.4, 17.2…
## $ MP_NOHSDP    <dbl> 1.5, 0.9, 2.2, 2.6, 1.3, 4.1, 2.1, 1.1, 2.0, 2.6, 2.3, 2.…
## $ EP_UNINSUR   <dbl> 8.0, 9.5, 10.7, 9.1, 10.0, 10.5, 10.1, 10.0, 9.2, 9.5, 11…
## $ MP_UNINSUR   <dbl> 1.3, 0.9, 1.4, 1.9, 1.2, 3.0, 1.9, 0.9, 1.4, 1.7, 1.8, 2.…
## $ EP_AGE65     <dbl> 15.3, 20.5, 19.1, 16.4, 18.0, 16.1, 20.7, 17.7, 19.5, 22.…
## $ MP_AGE65     <dbl> 0.2, 0.1, 0.1, 0.6, 0.2, 0.3, 0.5, 0.1, 0.1, 0.3, 0.2, 0.…
## $ EP_AGE17     <dbl> 23.6, 21.5, 20.9, 20.5, 23.2, 22.4, 22.3, 21.7, 20.8, 19.…
## $ MP_AGE17     <dbl> 0.1, 0.0, 0.1, 0.0, 0.1, 1.4, 0.7, 0.0, 0.2, 0.3, 0.1, 0.…
## $ EP_DISABL    <dbl> 17.7, 14.2, 18.8, 18.1, 15.0, 12.3, 15.8, 20.4, 18.5, 16.…
## $ MP_DISABL    <dbl> 1.6, 0.7, 1.4, 2.6, 1.4, 3.0, 1.6, 1.0, 1.6, 2.0, 1.7, 2.…
## $ EP_SNGPNT    <dbl> 7.5, 3.9, 11.0, 10.9, 6.2, 14.6, 7.5, 6.7, 6.6, 3.6, 6.4,…
## $ MP_SNGPNT    <dbl> 1.4, 0.6, 2.3, 3.4, 1.2, 5.8, 2.3, 0.9, 1.8, 1.2, 1.8, 2.…
## $ EP_LIMENG    <dbl> 0.7, 0.8, 1.8, 0.4, 1.5, 1.6, 0.3, 0.7, 0.4, 1.1, 1.7, 0.…
## $ MP_LIMENG    <dbl> 0.4, 0.2, 0.9, 0.5, 0.5, 1.2, 0.5, 0.2, 0.3, 0.7, 0.8, 0.…
## $ EP_MINRTY    <dbl> 26.0, 17.1, 54.7, 25.6, 13.3, 78.7, 48.7, 28.2, 44.9, 8.7…
## $ MP_MINRTY    <dbl> 1.0, 0.6, 1.1, 0.7, 0.6, 2.0, 0.7, 0.6, 0.6, 1.0, 1.4, 0.…
## $ EP_MUNIT     <dbl> 3.9, 16.7, 1.4, 2.5, 0.7, 1.6, 1.1, 3.7, 8.4, 1.5, 0.9, 0…
## $ MP_MUNIT     <dbl> 1.2, 0.8, 0.6, 1.2, 0.4, 1.7, 0.5, 0.6, 1.8, 0.6, 0.6, 0.…
## $ EP_MOBILE    <dbl> 18.2, 10.2, 30.2, 31.9, 24.8, 41.8, 23.2, 15.6, 13.4, 25.…
## $ MP_MOBILE    <dbl> 1.8, 0.7, 2.8, 3.5, 2.1, 5.9, 2.7, 1.2, 2.0, 2.3, 2.8, 4.…
## $ EP_CROWD     <dbl> 1.6, 1.5, 3.8, 1.3, 1.8, 1.0, 1.1, 1.2, 7.2, 1.4, 2.7, 1.…
## $ MP_CROWD     <dbl> 0.8, 0.4, 2.1, 0.8, 0.7, 1.0, 0.7, 0.3, 2.4, 0.7, 1.0, 0.…
## $ EP_NOVEH     <dbl> 5.4, 3.1, 11.1, 6.6, 5.1, 12.2, 5.4, 5.1, 7.4, 3.2, 3.9, …
## $ MP_NOVEH     <dbl> 1.5, 0.5, 2.1, 3.1, 1.3, 5.5, 1.4, 0.7, 1.8, 1.0, 1.2, 2.…
## $ EP_GROUPQ    <dbl> 1.0, 1.4, 11.6, 7.4, 1.0, 4.3, 1.6, 2.6, 1.7, 1.1, 1.0, 1…
## $ MP_GROUPQ    <dbl> 0.3, 0.2, 1.1, 1.0, 0.2, 1.6, 0.7, 0.4, 0.5, 0.6, 0.3, 0.…
## $ EPL_POV150   <dbl> 0.4624, 0.1770, 0.9586, 0.6820, 0.5010, 0.9694, 0.8119, 0…
## $ EPL_UNEMP    <dbl> 0.1397, 0.2861, 0.8071, 0.8485, 0.5586, 0.2005, 0.7689, 0…
## $ EPL_HBURD    <dbl> 0.6038, 0.5808, 0.7075, 0.4831, 0.2651, 0.3628, 0.4055, 0…
## $ EPL_NOHSDP   <dbl> 0.5080, 0.3647, 0.9688, 0.8654, 0.8027, 0.9666, 0.6464, 0…
## $ EPL_UNINSUR  <dbl> 0.4586, 0.5767, 0.6680, 0.5471, 0.6181, 0.6563, 0.6254, 0…
## $ SPL_THEME1   <dbl> 2.1725, 1.9853, 4.1100, 3.4261, 2.7455, 3.1556, 3.2581, 3…
## $ RPL_THEME1   <dbl> 0.3838, 0.3253, 0.9567, 0.8008, 0.5757, 0.7183, 0.7486, 0…
## $ EPL_AGE65    <dbl> 0.1827, 0.6496, 0.5216, 0.2623, 0.4090, 0.2400, 0.6728, 0…
## $ EPL_AGE17    <dbl> 0.6986, 0.4077, 0.3316, 0.2903, 0.6489, 0.5369, 0.5220, 0…
## $ EPL_DISABL   <dbl> 0.6830, 0.3689, 0.7581, 0.7101, 0.4421, 0.2011, 0.5191, 0…
## $ EPL_SNGPNT   <dbl> 0.7922, 0.1738, 0.9656, 0.9637, 0.6108, 0.9949, 0.7922, 0…
## $ EPL_LIMENG   <dbl> 0.4736, 0.5150, 0.7460, 0.3116, 0.6986, 0.7139, 0.2416, 0…
## $ SPL_THEME2   <dbl> 2.8301, 2.1150, 3.3229, 2.5380, 2.8094, 2.6868, 2.7477, 2…
## $ RPL_THEME2   <dbl> 0.7362, 0.2724, 0.9453, 0.5512, 0.7225, 0.6515, 0.6887, 0…
## $ EPL_MINRTY   <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.9809, 0.8686, 0…
## $ SPL_THEME3   <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.9809, 0.8686, 0…
## $ RPL_THEME3   <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.9809, 0.8686, 0…
## $ EPL_MUNIT    <dbl> 0.6050, 0.9574, 0.2565, 0.4402, 0.1187, 0.2909, 0.1983, 0…
## $ EPL_MOBILE   <dbl> 0.7486, 0.4885, 0.9427, 0.9554, 0.8803, 0.9914, 0.8571, 0…
## $ EPL_CROWD    <dbl> 0.4023, 0.3600, 0.8549, 0.2759, 0.4742, 0.1680, 0.1996, 0…
## $ EPL_NOVEH    <dbl> 0.4764, 0.1257, 0.9456, 0.6556, 0.4281, 0.9609, 0.4764, 0…
## $ EPL_GROUPQ   <dbl> 0.1569, 0.3157, 0.9513, 0.8873, 0.1569, 0.7766, 0.3873, 0…
## $ SPL_THEME4   <dbl> 2.3892, 2.2473, 3.9510, 3.2144, 2.0582, 3.1878, 2.1187, 2…
## $ RPL_THEME4   <dbl> 0.4309, 0.3612, 0.9949, 0.8622, 0.2743, 0.8501, 0.2982, 0…
## $ SPL_THEMES   <dbl> 8.0255, 6.8498, 12.2801, 9.8077, 8.0278, 10.0111, 8.9931,…
## $ RPL_THEMES   <dbl> 0.5130, 0.3103, 0.9927, 0.8078, 0.5137, 0.8310, 0.6843, 0…
## $ F_POV150     <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_UNEMP      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ F_HBURD      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_NOHSDP     <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_UNINSUR    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_THEME1     <dbl> 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ F_AGE65      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_AGE17      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_DISABL     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, …
## $ F_SNGPNT     <dbl> 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ F_LIMENG     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_THEME2     <dbl> 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, …
## $ F_MINRTY     <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_THEME3     <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_MUNIT      <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_MOBILE     <dbl> 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, …
## $ F_CROWD      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_NOVEH      <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_GROUPQ     <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ F_THEME4     <dbl> 0, 1, 3, 1, 0, 2, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, …
## $ F_TOTAL      <dbl> 0, 1, 6, 2, 0, 6, 0, 0, 1, 0, 1, 2, 2, 1, 1, 0, 0, 2, 2, …
## $ E_DAYPOP     <dbl> 41810, 218607, 27133, 18799, 42172, 7990, 18138, 118672, …
## $ E_NOINT      <dbl> 7100, 24453, 6249, 3839, 8987, 3174, 4683, 18425, 6618, 4…
## $ M_NOINT      <dbl> 913, 2477, 675, 626, 1175, 700, 645, 1731, 745, 589, 844,…
## $ E_AFAM       <dbl> 10849, 19027, 11889, 4971, 771, 6980, 8773, 23669, 13372,…
## $ M_AFAM       <dbl> 345, 744, 179, 111, 167, 140, 76, 380, 127, 120, 420, 79,…
## $ E_HISP       <dbl> 1601, 9947, 1110, 600, 5362, 824, 290, 4406, 843, 432, 34…
## $ M_HISP       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 63, 124, 0, 0, 0, 0, 99,…
## $ E_ASIAN      <dbl> 649, 2033, 122, 56, 236, 137, 261, 919, 368, 26, 195, 25,…
## $ M_ASIAN      <dbl> 174, 352, 25, 81, 41, 117, 41, 186, 50, 35, 55, 32, 27, 3…
## $ E_AIAN       <dbl> 155, 1327, 81, 12, 49, 0, 66, 168, 81, 150, 11, 7, 18, 0,…
## $ M_AIAN       <dbl> 102, 371, 63, 25, 41, 19, 52, 66, 44, 124, 36, 10, 27, 19…
## $ E_NHPI       <dbl> 0, 10, 1, 0, 55, 0, 0, 327, 9, 0, 0, 0, 0, 0, 0, 0, 57, 0…
## $ M_NHPI       <dbl> 29, 16, 2, 23, 60, 19, 19, 135, 16, 23, 26, 19, 23, 19, 1…
## $ E_TWOMORE    <dbl> 1124, 4250, 334, 85, 1038, 68, 211, 2610, 267, 398, 797, …
## $ M_TWOMORE    <dbl> 374, 871, 149, 67, 181, 76, 94, 414, 139, 176, 411, 53, 1…
## $ E_OTHERRACE  <dbl> 101, 740, 157, 0, 179, 2, 0, 176, 74, 0, 170, 0, 47, 5, 1…
## $ M_OTHERRACE  <dbl> 145, 534, 110, 23, 225, 4, 19, 209, 83, 23, 187, 19, 67, …
## $ EP_NOINT     <dbl> 12.9, 11.4, 28.3, 18.5, 15.7, 32.6, 24.1, 16.5, 20.1, 16.…
## $ MP_NOINT     <dbl> 0.0, 0.0, 0.3, 0.2, 0.0, 0.6, 0.2, 0.1, 0.1, 0.1, 0.1, 0.…
## $ EP_AFAM      <dbl> 19.5, 8.7, 47.5, 22.2, 1.3, 68.6, 44.5, 20.7, 40.0, 4.8, …
## $ MP_AFAM      <dbl> 0.6, 0.3, 0.7, 0.5, 0.3, 1.4, 0.4, 0.3, 0.4, 0.5, 1.0, 0.…
## $ EP_HISP      <dbl> 2.9, 4.6, 4.4, 2.7, 9.3, 8.1, 1.5, 3.9, 2.5, 1.7, 7.8, 0.…
## $ MP_HISP      <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.…
## $ EP_ASIAN     <dbl> 1.2, 0.9, 0.5, 0.3, 0.4, 1.3, 1.3, 0.8, 1.1, 0.1, 0.4, 0.…
## $ MP_ASIAN     <dbl> 0.3, 0.2, 0.1, 0.4, 0.1, 1.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.…
## $ EP_AIAN      <dbl> 0.3, 0.6, 0.3, 0.1, 0.1, 0.0, 0.3, 0.1, 0.2, 0.6, 0.0, 0.…
## $ MP_AIAN      <dbl> 0.2, 0.2, 0.3, 0.1, 0.1, 0.3, 0.3, 0.1, 0.1, 0.5, 0.1, 0.…
## $ EP_NHPI      <dbl> 0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.3, 0.0, 0.0, 0.0, 0.…
## $ MP_NHPI      <dbl> 0.1, 0.1, 0.1, 0.2, 0.1, 0.3, 0.2, 0.1, 0.1, 0.1, 0.1, 0.…
## $ EP_TWOMORE   <dbl> 2.0, 1.9, 1.3, 0.4, 1.8, 0.7, 1.1, 2.3, 0.8, 1.5, 1.8, 0.…
## $ MP_TWOMORE   <dbl> 0.7, 0.4, 0.6, 0.3, 0.3, 0.7, 0.5, 0.4, 0.4, 0.7, 0.9, 0.…
## $ EP_OTHERRACE <dbl> 0.2, 0.3, 0.6, 0.0, 0.3, 0.0, 0.0, 0.2, 0.2, 0.0, 0.4, 0.…
## $ MP_OTHERRACE <dbl> 0.3, 0.2, 0.4, 0.2, 0.4, 0.1, 0.2, 0.2, 0.2, 0.1, 0.4, 0.…

3.2 3.2 Clean SVI Data (Create county_fips, select key variables)

svi_county <- svi_raw %>%
  # standardize column names
  rename_with(tolower) %>%
  
  # create standardized 5-digit county FIPS (same format as FCC)
  mutate(
    county_fips = str_pad(as.character(fips), 5, "left", "0")
  ) %>%
  
  # keep only the variables needed for analysis
  select(
    county_fips,
    state = st_abbr,
    county,
    svi_overall = rpl_themes,
    svi_soc     = rpl_theme1,
    svi_hh      = rpl_theme2,
    svi_min     = rpl_theme3,
    svi_hous    = rpl_theme4
  )

head(svi_county)
## # A tibble: 6 × 8
##   county_fips state county  svi_overall svi_soc svi_hh svi_min svi_hous
##   <chr>       <chr> <chr>         <dbl>   <dbl>  <dbl>   <dbl>    <dbl>
## 1 01001       AL    Autauga       0.513   0.384  0.736   0.634    0.431
## 2 01003       AL    Baldwin       0.310   0.325  0.272   0.502    0.361
## 3 01005       AL    Barbour       0.993   0.957  0.945   0.896    0.995
## 4 01007       AL    Bibb          0.808   0.801  0.551   0.629    0.862
## 5 01009       AL    Blount        0.514   0.576  0.722   0.415    0.274
## 6 01011       AL    Bullock       0.831   0.718  0.652   0.981    0.850
summary(svi_county)
##  county_fips           state              county           svi_overall    
##  Length:3143        Length:3143        Length:3143        Min.   :0.0000  
##  Class :character   Class :character   Class :character   1st Qu.:0.2500  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.4997  
##                                                           Mean   :0.5000  
##                                                           3rd Qu.:0.7500  
##                                                           Max.   :1.0000  
##     svi_soc         svi_hh          svi_min          svi_hous     
##  Min.   :0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.25   1st Qu.:0.2500   1st Qu.:0.2454   1st Qu.:0.2500  
##  Median :0.50   Median :0.4997   Median :0.4981   Median :0.5000  
##  Mean   :0.50   Mean   :0.5000   Mean   :0.4988   Mean   :0.5000  
##  3rd Qu.:0.75   3rd Qu.:0.7500   3rd Qu.:0.7495   3rd Qu.:0.7499  
##  Max.   :1.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

3.3 3.3 Save Cleaned SVI Dataset

write_csv(svi_county, path_processed("svi_2020_county_clean.csv"))
saveRDS(svi_county, path_processed("svi_2020_county_clean.rds"))

3.4 3.4 SVI Validation and Consistency Checks

After cleaning the SVI data and keeping only the overall index and four theme scores, we validated the SVI variables to ensure they are suitable for analysis. We checked that all SVI values fall within the expected 0–1 percentile range, identified any missing values, and examined correlations between the overall SVI score and each theme.

library(dplyr)

# If not already in memory, load the cleaned SVI
svi_2020_county_clean <- readRDS("processed_data/svi_2020_county_clean.rds")

# Keep only SVI numeric columns for checks
svi_check <- svi_2020_county_clean %>%
  select(
    county_fips,
    svi_overall,
    svi_soc,
    svi_hh,
    svi_min,
    svi_hous
  )

# 1) Range validation: SVI percentiles should be between 0 and 1
summary(svi_check)
##  county_fips         svi_overall        svi_soc         svi_hh      
##  Length:3143        Min.   :0.0000   Min.   :0.00   Min.   :0.0000  
##  Class :character   1st Qu.:0.2500   1st Qu.:0.25   1st Qu.:0.2500  
##  Mode  :character   Median :0.4997   Median :0.50   Median :0.4997  
##                     Mean   :0.5000   Mean   :0.50   Mean   :0.5000  
##                     3rd Qu.:0.7500   3rd Qu.:0.75   3rd Qu.:0.7500  
##                     Max.   :1.0000   Max.   :1.00   Max.   :1.0000  
##     svi_min          svi_hous     
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2454   1st Qu.:0.2500  
##  Median :0.4981   Median :0.5000  
##  Mean   :0.4988   Mean   :0.5000  
##  3rd Qu.:0.7495   3rd Qu.:0.7499  
##  Max.   :1.0000   Max.   :1.0000
invalid_svi <- svi_check %>%
  filter(
    svi_overall < 0 | svi_overall > 1 |
    svi_soc     < 0 | svi_soc     > 1 |
    svi_hh      < 0 | svi_hh      > 1 |
    svi_min     < 0 | svi_min     > 1 |
    svi_hous    < 0 | svi_hous    > 1
  )

invalid_svi   # ideally this is an empty tibble
## # A tibble: 0 × 6
## # ℹ 6 variables: county_fips <chr>, svi_overall <dbl>, svi_soc <dbl>,
## #   svi_hh <dbl>, svi_min <dbl>, svi_hous <dbl>
# 2) Missing values check for SVI columns
colSums(is.na(svi_check))
## county_fips svi_overall     svi_soc      svi_hh     svi_min    svi_hous 
##           0           0           0           0           0           0

4 4. Merge FCC and SVI Data

In this section, we combine the cleaned FCC broadband dataset (December 2020) with the cleaned SVI 2020 county dataset using 5-digit county FIPS codes.

4.1 4.1 Load Cleaned FCC and SVI Datasets

fcc_2020 <- readRDS(path_processed(fcc_file))  # Uses configured FCC year
svi_2020 <- readRDS(path_processed("svi_2020_county_clean.rds"))
cat("Loaded FCC data:", fcc_year_label, "-", nrow(fcc_2020), "counties\n")
## Loaded FCC data: May 2025 - 3232 counties
head(fcc_2020)
## # A tibble: 6 × 11
##   county_fips state_name county_name    housing_units tier1 tier2 tier3
##   <chr>       <chr>      <chr>                  <dbl> <int> <int> <int>
## 1 01001       AL         Autauga County         27544     4     4     4
## 2 01003       AL         Baldwin County        137072     4     2     2
## 3 01005       AL         Barbour County         15445     4     2     2
## 4 01007       AL         Bibb County            11168     4     1     1
## 5 01009       AL         Blount County          31265     4     1     1
## 6 01011       AL         Bullock County          6136     4     3     1
## # ℹ 4 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## #   coverage_250_25 <dbl>, coverage_1000_100 <dbl>
head(svi_2020)
## # A tibble: 6 × 8
##   county_fips state county  svi_overall svi_soc svi_hh svi_min svi_hous
##   <chr>       <chr> <chr>         <dbl>   <dbl>  <dbl>   <dbl>    <dbl>
## 1 01001       AL    Autauga       0.513   0.384  0.736   0.634    0.431
## 2 01003       AL    Baldwin       0.310   0.325  0.272   0.502    0.361
## 3 01005       AL    Barbour       0.993   0.957  0.945   0.896    0.995
## 4 01007       AL    Bibb          0.808   0.801  0.551   0.629    0.862
## 5 01009       AL    Blount        0.514   0.576  0.722   0.415    0.274
## 6 01011       AL    Bullock       0.831   0.718  0.652   0.981    0.850

4.2 4.2 Identify FCC Counties Without Matching SVI Records

missing_svi <- fcc_2020 %>%
  anti_join(svi_2020, by = "county_fips")

head(missing_svi)
## # A tibble: 6 × 11
##   county_fips state_name county_name       housing_units tier1 tier2 tier3
##   <chr>       <chr>      <chr>                     <dbl> <int> <int> <int>
## 1 60010       AS         Eastern District           3246     4     1     1
## 2 60020       AS         Manu'a District             419     4     1     1
## 3 60040       AS         Swains Island                 1     1     1     1
## 4 60050       AS         Western District           5673     4     1     1
## 5 66010       GU         Guam                      37572     4     3     1
## 6 69100       MP         Rota Municipality           868     4     4     1
## # ℹ 4 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## #   coverage_250_25 <dbl>, coverage_1000_100 <dbl>
nrow(missing_svi)
## [1] 89

The missing_svi table shows FCC counties and territories that appear in the broadband data but do not have corresponding SVI 2020 county records. These are primarily U.S. territories such as Puerto Rico, Guam, American Samoa, and the U.S. Virgin Islands. Because SVI is not available for these areas, they will be excluded from our final merged dataset. ## 4.3 Create Final Merged Dataset (FCC + SVI)

merged_data <- fcc_2020 %>%
  inner_join(svi_2020, by = "county_fips")

# Inspect merged dataset
nrow(merged_data)
## [1] 3143
head(merged_data)
## # A tibble: 6 × 18
##   county_fips state_name county_name    housing_units tier1 tier2 tier3
##   <chr>       <chr>      <chr>                  <dbl> <int> <int> <int>
## 1 01001       AL         Autauga County         27544     4     4     4
## 2 01003       AL         Baldwin County        137072     4     2     2
## 3 01005       AL         Barbour County         15445     4     2     2
## 4 01007       AL         Bibb County            11168     4     1     1
## 5 01009       AL         Blount County          31265     4     1     1
## 6 01011       AL         Bullock County          6136     4     3     1
## # ℹ 11 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## #   coverage_250_25 <dbl>, coverage_1000_100 <dbl>, state <chr>, county <chr>,
## #   svi_overall <dbl>, svi_soc <dbl>, svi_hh <dbl>, svi_min <dbl>,
## #   svi_hous <dbl>
summary(merged_data)
##  county_fips         state_name        county_name        housing_units    
##  Length:3143        Length:3143        Length:3143        Min.   :    221  
##  Class :character   Class :character   Class :character   1st Qu.:   6847  
##  Mode  :character   Mode  :character   Mode  :character   Median :  15075  
##                                                           Mean   :  50401  
##                                                           3rd Qu.:  36756  
##                                                           Max.   :3868643  
##      tier1       tier2           tier3       coverage_25_3    coverage_100_20 
##  Min.   :4   Min.   :1.000   Min.   :1.000   Min.   :0.9996   Min.   :0.0000  
##  1st Qu.:4   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.0000   1st Qu.:0.7074  
##  Median :4   Median :4.000   Median :2.000   Median :1.0000   Median :0.9023  
##  Mean   :4   Mean   :3.104   Mean   :2.498   Mean   :1.0000   Mean   :0.8143  
##  3rd Qu.:4   3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:0.9911  
##  Max.   :4   Max.   :4.000   Max.   :4.000   Max.   :1.0000   Max.   :1.0000  
##  coverage_250_25  coverage_1000_100    state              county         
##  Min.   :0.0000   Min.   :0.0000    Length:3143        Length:3143       
##  1st Qu.:0.5050   1st Qu.:0.0476    Class :character   Class :character  
##  Median :0.7453   Median :0.2287    Mode  :character   Mode  :character  
##  Mean   :0.6748   Mean   :0.3071                                         
##  3rd Qu.:0.8991   3rd Qu.:0.4974                                         
##  Max.   :1.0000   Max.   :1.0000                                         
##   svi_overall        svi_soc         svi_hh          svi_min      
##  Min.   :0.0000   Min.   :0.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2500   1st Qu.:0.25   1st Qu.:0.2500   1st Qu.:0.2454  
##  Median :0.4997   Median :0.50   Median :0.4997   Median :0.4981  
##  Mean   :0.5000   Mean   :0.50   Mean   :0.5000   Mean   :0.4988  
##  3rd Qu.:0.7500   3rd Qu.:0.75   3rd Qu.:0.7500   3rd Qu.:0.7495  
##  Max.   :1.0000   Max.   :1.00   Max.   :1.0000   Max.   :1.0000  
##     svi_hous     
##  Min.   :0.0000  
##  1st Qu.:0.2500  
##  Median :0.5000  
##  Mean   :0.5000  
##  3rd Qu.:0.7499  
##  Max.   :1.0000

4.3 4.4 Save Merged Dataset

write_csv(merged_data, path_processed("merged_fcc_svi_2020.csv"))
saveRDS(merged_data, path_processed("merged_fcc_svi_2020.rds"))

5 5. Microsoft Airband Broadband Usage Data

Microsoft’s Airband dataset provides estimates of actual broadband usage based on Windows device telemetry. While FCC data describes where broadband service is available (deployment), Airband data describes where broadband is actually used (adoption). Together, these allow us to compare access (FCC) and usage (Airband) across U.S. counties.

5.1 5.1 Load and Inspect Airband Raw Data

airband_raw <- read_csv(
  path_raw("microsoft", "broadband_data_2020.csv")
)

names(airband_raw)
## [1] "ST"                             "COUNTY ID"                     
## [3] "COUNTY NAME"                    "BROADBAND AVAILABILITY PER FCC"
## [5] "BROADBAND USAGE"
glimpse(airband_raw)
## Rows: 3,142
## Columns: 5
## $ ST                               <chr> "AL", "AL", "AL", "AL", "AL", "AL", "…
## $ `COUNTY ID`                      <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1…
## $ `COUNTY NAME`                    <chr> "Autauga County", "Baldwin County", "…
## $ `BROADBAND AVAILABILITY PER FCC` <chr> "0.8057", "0.8362", "0.6891", "0.3368…
## $ `BROADBAND USAGE`                <chr> "0.391", "0.452", "0.324", "0.136", "…

5.2 5.2 Clean Airband Data

airband_clean <- airband_raw %>%
  # Standardize column names
  rename_with(~ .x |>
                str_trim() |>
                str_replace_all("\\s+", "_") |>
                str_to_lower()) %>%
  # Now expect columns:
  # st, county_id, county_name, broadband_availability_per_fcc, broadband_usage
  mutate(
    county_fips = str_pad(as.character(county_id), 5, pad = "0")
  ) %>%
  select(
    county_fips,
    state = st,
    county_name,
    airband_fcc_availability = broadband_availability_per_fcc,
    airband_usage = broadband_usage
  )

# Inspect
names(airband_clean)
## [1] "county_fips"              "state"                   
## [3] "county_name"              "airband_fcc_availability"
## [5] "airband_usage"
head(airband_clean)
## # A tibble: 6 × 5
##   county_fips state county_name    airband_fcc_availability airband_usage
##   <chr>       <chr> <chr>          <chr>                    <chr>        
## 1 01001       AL    Autauga County 0.8057                   0.391        
## 2 01003       AL    Baldwin County 0.8362                   0.452        
## 3 01005       AL    Barbour County 0.6891                   0.324        
## 4 01007       AL    Bibb County    0.3368                   0.136        
## 5 01009       AL    Blount County  0.758                    0.199        
## 6 01011       AL    Bullock County 0.9363                   0.157
summary(airband_clean$airband_usage)
##    Length     Class      Mode 
##      3142 character character

5.3 5.3 Airband Validation Checks

# Load cleaned Microsoft Airband dataset
airband_2020 <- readRDS("processed_data/airband_2020_county_clean.rds")

# Overview of structure
glimpse(airband_2020)
## Rows: 3,142
## Columns: 5
## $ county_fips              <chr> "01001", "01003", "01005", "01007", "01009", …
## $ state                    <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL…
## $ county_name              <chr> "Autauga County", "Baldwin County", "Barbour …
## $ airband_fcc_availability <chr> "0.8057", "0.8362", "0.6891", "0.3368", "0.75…
## $ airband_usage            <chr> "0.391", "0.452", "0.324", "0.136", "0.199", …
# Use county_fips (not FIPS)
airband_check <- airband_2020 %>%
  select(
    county_fips,
    airband_fcc_availability,
    airband_usage
  )

# Convert availability and usage to numeric if stored as characters
airband_check <- airband_check %>%
  mutate(
    airband_fcc_availability = as.numeric(airband_fcc_availability),
    airband_usage            = as.numeric(airband_usage)
  )

# 1) Summary statistics
summary(airband_check)
##  county_fips        airband_fcc_availability airband_usage   
##  Length:3142        Min.   :0.0002           Min.   :0.0010  
##  Class :character   1st Qu.:0.7700           1st Qu.:0.1960  
##  Mode  :character   Median :0.9242           Median :0.3680  
##                     Mean   :0.8404           Mean   :0.3896  
##                     3rd Qu.:0.9839           3rd Qu.:0.5660  
##                     Max.   :1.0000           Max.   :1.0000  
##                     NA's   :9                NA's   :11
# 2) Range validation
invalid_airband_range <- airband_check %>%
  filter(
    airband_fcc_availability < 0 | airband_fcc_availability > 1 |
    airband_usage < 0 | airband_usage > 1
  )

invalid_airband_range   # ideally 0 rows
## # A tibble: 0 × 3
## # ℹ 3 variables: county_fips <chr>, airband_fcc_availability <dbl>,
## #   airband_usage <dbl>
# 3) Missingness check
colSums(is.na(airband_check))
##              county_fips airband_fcc_availability            airband_usage 
##                        0                        9                       11
# 4) Consistency check: usage > availability
airband_usage_gt_avail <- airband_check %>%
  filter(
    !is.na(airband_fcc_availability),
    !is.na(airband_usage),
    airband_usage > airband_fcc_availability
  )

# Show examples
airband_usage_gt_avail %>% head()
## # A tibble: 6 × 3
##   county_fips airband_fcc_availability airband_usage
##   <chr>                          <dbl>         <dbl>
## 1 01063                         0.0085         0.013
## 2 01105                         0.0012         0.032
## 3 02185                         0.0061         0.051
## 4 02188                         0.0081         0.039
## 5 02198                         0.0282         0.04 
## 6 02290                         0.013          0.139
nrow(airband_usage_gt_avail)
## [1] 50

Airband Data Validation Summary

The Microsoft Airband dataset was validated after cleaning and aggregation to the county level. Both broadband indicators—modeled FCC availability and estimated household usage—fall entirely within the expected 0–1 range, with missing values limited to counties in Alaska and U.S. territories where Airband suppresses estimates. A subset of counties exhibit usage values that exceed modeled availability. This behavior is well-documented in the Airband methodology and reflects differences between modeled infrastructure coverage and actual adoption (including satellite-based broadband). These records were retained, as they represent genuine modeling differences rather than data errors. (# Need to check this)

5.4 5.4 Save Cleaned Airband Dataset

write_csv(airband_clean, path_processed("airband_2020_county_clean.csv"))
saveRDS(airband_clean, path_processed("airband_2020_county_clean.rds"))

6 5.5 Ookla Speedtest Data (2020)

Ookla provides actual measured internet performance data from Speedtest users. Unlike FCC availability data (which measures where broadband could be offered) or Microsoft Airband usage data, Ookla data captures real-world speeds experienced by consumers. This includes download/upload speeds and latency.

The raw Ookla data is tile-based (millions of geographic tiles). We pre-aggregated this to county level using spatial joins, creating weighted averages by number of speed tests.

6.1 5.5.1 Load Ookla County-Level Data

# Load pre-aggregated Ookla data (created by scripts/aggregate_ookla_to_county.R)
ookla_2020 <- readRDS(path_processed("ookla_2020_county_clean.rds"))

# Inspect structure
glimpse(ookla_2020)
## Rows: 3,124
## Columns: 9
## $ county_fips          <chr> "01001", "01003", "01005", "01007", "01009", "010…
## $ avg_download_mbps    <dbl> 139.27879, 105.00465, 99.15987, 37.20531, 56.9981…
## $ avg_upload_mbps      <dbl> 66.695142, 55.552737, 11.026225, 22.346111, 23.61…
## $ avg_latency_ms       <dbl> 22.40162, 50.51657, 55.41985, 156.95415, 35.75194…
## $ median_download_mbps <dbl> 111.8190, 45.4395, 36.9190, 16.5380, 23.6010, 15.…
## $ median_upload_mbps   <dbl> 15.2790, 11.1445, 6.9430, 3.2920, 4.3810, 3.0405,…
## $ total_tests          <int> 3095, 15144, 262, 458, 3729, 187, 568, 3224, 900,…
## $ total_devices        <int> 833, 5026, 94, 156, 903, 39, 139, 1233, 211, 342,…
## $ n_tiles              <int> 325, 1860, 70, 117, 611, 36, 96, 716, 156, 268, 2…
head(ookla_2020)
## # A tibble: 6 × 9
##   county_fips avg_download_mbps avg_upload_mbps avg_latency_ms
##   <chr>                   <dbl>           <dbl>          <dbl>
## 1 01001                   139.            66.7            22.4
## 2 01003                   105.            55.6            50.5
## 3 01005                    99.2           11.0            55.4
## 4 01007                    37.2           22.3           157. 
## 5 01009                    57.0           23.6            35.8
## 6 01011                    20.6            7.40           41.0
## # ℹ 5 more variables: median_download_mbps <dbl>, median_upload_mbps <dbl>,
## #   total_tests <int>, total_devices <int>, n_tiles <int>

6.2 5.5.2 Ookla Data Summary

# Summary statistics
cat("=== OOKLA SPEEDTEST SUMMARY (2020) ===\n\n")
## === OOKLA SPEEDTEST SUMMARY (2020) ===
cat("Counties with data:", nrow(ookla_2020), "\n\n")
## Counties with data: 3124
cat("Download Speed (Mbps):\n")
## Download Speed (Mbps):
print(summary(ookla_2020$avg_download_mbps))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.448  42.141  76.151  80.142 113.324 339.268
cat("\nUpload Speed (Mbps):\n")
## 
## Upload Speed (Mbps):
print(summary(ookla_2020$avg_upload_mbps))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.217   9.666  16.013  26.432  33.977 200.109
cat("\nLatency (ms):\n")
## 
## Latency (ms):
print(summary(ookla_2020$avg_latency_ms))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.70   25.61   40.27   56.56   64.41  745.39
# Distribution plots
library(ggplot2)

p1 <- ggplot(ookla_2020, aes(x = avg_download_mbps)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  labs(title = "Distribution of County-Level Download Speeds",
       x = "Average Download Speed (Mbps)", y = "Count") +
  theme_minimal()

p2 <- ggplot(ookla_2020, aes(x = avg_latency_ms)) +
  geom_histogram(bins = 50, fill = "coral", alpha = 0.7) +
  labs(title = "Distribution of County-Level Latency",
       x = "Average Latency (ms)", y = "Count") +
  theme_minimal()

print(p1)

print(p2)

6.3 5.5.3 Ookla Validation Checks

# Check for reasonable ranges
invalid_ookla <- ookla_2020 %>%
  filter(
    avg_download_mbps < 0 | avg_download_mbps > 1000 |
    avg_upload_mbps < 0 | avg_upload_mbps > 500 |
    avg_latency_ms < 0 | avg_latency_ms > 1000
  )

cat("Invalid Ookla records:", nrow(invalid_ookla), "\n")
## Invalid Ookla records: 0
# Check sample sizes
cat("\nSample size distribution (tests per county):\n")
## 
## Sample size distribution (tests per county):
print(summary(ookla_2020$total_tests))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     331    1061    6869    3484  938240
# Counties with low sample sizes
low_sample <- ookla_2020 %>% filter(total_tests < 100)
cat("\nCounties with <100 tests:", nrow(low_sample), "\n")
## 
## Counties with <100 tests: 281

7 5.6 USDA Rural-Urban Continuum Codes (2023)

The USDA Economic Research Service provides Rural-Urban Continuum Codes (RUCC) that classify counties by their degree of urbanization and proximity to metropolitan areas. Codes range from 1 (most urban/metro) to 9 (most rural/remote). This classification is essential for understanding the rural-urban dimension of the digital divide.

7.1 5.6.1 Load USDA RUCC Data

# Load pre-processed USDA data (created by scripts/process_usda_rucc.R)
usda_rucc <- readRDS(path_processed("usda_rucc_2023_clean.rds"))

cat("USDA RUCC 2023 Data:\n")
## USDA RUCC 2023 Data:
cat("Counties:", nrow(usda_rucc), "\n\n")
## Counties: 3235
glimpse(usda_rucc)
## Rows: 3,235
## Columns: 9
## $ county_fips      <chr> "01001", "01003", "01005", "01007", "01009", "01011",…
## $ state_abbr       <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
## $ county_name      <chr> "Autauga County", "Baldwin County", "Barbour County",…
## $ population_2020  <dbl> 58805, 231767, 25223, 22293, 59134, 10357, 19051, 116…
## $ rucc_2023        <int> 2, 3, 6, 1, 1, 8, 6, 3, 6, 8, 1, 9, 9, 9, 8, 4, 3, 9,…
## $ rucc_description <chr> "Metro - Counties in metro areas of 250,000 to 1 mill…
## $ is_metro         <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FA…
## $ is_rural         <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, …
## $ rural_urban_cat  <chr> "Metro", "Metro", "Nonmetro-Adjacent", "Metro", "Metr…

7.2 5.6.2 RUCC Distribution

cat("RUCC Code Distribution:\n")
## RUCC Code Distribution:
print(table(usda_rucc$rucc_2023))
## 
##   1   2   3   4   5   6   7   8   9 
## 483 398 371 204  81 385 248 468 595
cat("\nRural-Urban Category:\n")
## 
## Rural-Urban Category:
print(table(usda_rucc$rural_urban_cat))
## 
##                Metro    Nonmetro-Adjacent Nonmetro-Nonadjacent 
##                 1252                  670                 1313
# Visualization
ggplot(usda_rucc, aes(x = factor(rucc_2023), fill = rural_urban_cat)) +
  geom_bar() +
  labs(title = "Distribution of USDA Rural-Urban Continuum Codes",
       x = "RUCC Code (1=Most Urban, 9=Most Rural)",
       y = "Number of Counties",
       fill = "Category") +
  scale_fill_manual(values = c("Metro" = "steelblue",
                               "Nonmetro-Adjacent" = "orange",
                               "Nonmetro-Nonadjacent" = "darkred")) +
  theme_minimal()

7.3 5.6.3 RUCC Code Definitions

# Show unique descriptions for each RUCC code
rucc_defs <- usda_rucc %>%
  select(rucc_2023, rucc_description) %>%
  distinct() %>%
  arrange(rucc_2023)

print(rucc_defs)
## # A tibble: 10 × 2
##    rucc_2023 rucc_description                                                   
##        <int> <chr>                                                              
##  1         1 Metro - Counties in metro areas of 1 million population or more    
##  2         2 Metro - Counties in metro areas of 250,000 to 1 million population 
##  3         3 Metro - Counties in metro areas of fewer than 250,000 population   
##  4         4 Nonmetro - Urban population of 20,000 or more, adjacent to a metro…
##  5         5 Nonmetro - Urban population of 20,000 or more, not adjacent to a m…
##  6         6 Nonmetro - Urban population of 5,000 to 20,000, adjacent to a metr…
##  7         7 Nonmetro - Urban population of 5,000 to 20,000, not adjacent to a …
##  8         8 Nonmetro - Urban population of fewer than 5,000, adjacent to a met…
##  9         9 Nonmetro - Urban population of fewer than 5,000, not adjacent to a…
## 10        NA Not Applicable

8 6. Final Master Dataset Merge

8.1 6.1 Load Cleaned FCC, SVI, Airband, Ookla, and USDA Datasets

fcc_2020      <- readRDS(path_processed(fcc_file))  # Uses configured FCC year
svi_2020      <- readRDS(path_processed("svi_2020_county_clean.rds"))
airband_clean <- readRDS(path_processed("airband_2020_county_clean.rds"))
ookla_2020    <- readRDS(path_processed("ookla_2020_county_clean.rds"))
usda_rucc     <- readRDS(path_processed("usda_rucc_2023_clean.rds"))

# Row counts for confirmation
cat("FCC data:", fcc_year_label, "-", nrow(fcc_2020), "counties\n")
## FCC data: May 2025 - 3232 counties
cat("SVI counties:", nrow(svi_2020), "\n")
## SVI counties: 3143
cat("Airband counties:", nrow(airband_clean), "\n")
## Airband counties: 3142
cat("Ookla counties:", nrow(ookla_2020), "\n")
## Ookla counties: 3124
cat("USDA RUCC counties:", nrow(usda_rucc), "\n")
## USDA RUCC counties: 3235

8.2 6.2 Identify Counties Missing Airband Data

missing_airband <- merged_data %>%
  anti_join(airband_clean, by = "county_fips")

head(missing_airband)
## # A tibble: 2 × 18
##   county_fips state_name county_name             housing_units tier1 tier2 tier3
##   <chr>       <chr>      <chr>                           <dbl> <int> <int> <int>
## 1 02063       AK         Chugach Census Area              4074     4     2     1
## 2 02066       AK         Copper River Census Ar…          3898     4     4     2
## # ℹ 11 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## #   coverage_250_25 <dbl>, coverage_1000_100 <dbl>, state <chr>, county <chr>,
## #   svi_overall <dbl>, svi_soc <dbl>, svi_hh <dbl>, svi_min <dbl>,
## #   svi_hous <dbl>
nrow(missing_airband)
## [1] 2

8.3 6.3 Create Final Master Dataset (FCC + SVI + Airband + Ookla + USDA)

# Merge Airband
merged_all <- merged_data %>%
  inner_join(airband_clean, by = "county_fips")

# Merge Ookla (left join to keep all counties even if no Ookla data)
merged_all <- merged_all %>%
  left_join(ookla_2020, by = "county_fips")

# Merge USDA RUCC (left join, select only key variables to avoid name conflicts)
usda_for_merge <- usda_rucc %>%
  select(county_fips, rucc_2023, is_metro, is_rural, rural_urban_cat)

merged_all <- merged_all %>%
  left_join(usda_for_merge, by = "county_fips")

# Inspect final dataset
cat("Final merged dataset rows:", nrow(merged_all), "\n")
## Final merged dataset rows: 3141
cat("Counties with Ookla data:", sum(!is.na(merged_all$avg_download_mbps)), "\n")
## Counties with Ookla data: 3122
cat("Counties with USDA RUCC:", sum(!is.na(merged_all$rucc_2023)), "\n\n")
## Counties with USDA RUCC: 3133
# Rural-urban breakdown
cat("Metro vs Nonmetro in merged data:\n")
## Metro vs Nonmetro in merged data:
print(table(merged_all$rural_urban_cat, useNA = "ifany"))
## 
##                Metro    Nonmetro-Adjacent Nonmetro-Nonadjacent 
##                 1179                  654                 1300 
##                 <NA> 
##                    8
head(merged_all)
## # A tibble: 6 × 34
##   county_fips state_name county_name.x  housing_units tier1 tier2 tier3
##   <chr>       <chr>      <chr>                  <dbl> <int> <int> <int>
## 1 01001       AL         Autauga County         27544     4     4     4
## 2 01003       AL         Baldwin County        137072     4     2     2
## 3 01005       AL         Barbour County         15445     4     2     2
## 4 01007       AL         Bibb County            11168     4     1     1
## 5 01009       AL         Blount County          31265     4     1     1
## 6 01011       AL         Bullock County          6136     4     3     1
## # ℹ 27 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## #   coverage_250_25 <dbl>, coverage_1000_100 <dbl>, state.x <chr>,
## #   county <chr>, svi_overall <dbl>, svi_soc <dbl>, svi_hh <dbl>,
## #   svi_min <dbl>, svi_hous <dbl>, state.y <chr>, county_name.y <chr>,
## #   airband_fcc_availability <chr>, airband_usage <chr>,
## #   avg_download_mbps <dbl>, avg_upload_mbps <dbl>, avg_latency_ms <dbl>,
## #   median_download_mbps <dbl>, median_upload_mbps <dbl>, total_tests <int>, …

8.4 6.4 Save Final Merged Dataset

# Save with updated name reflecting all integrations
write_csv(merged_all, path_processed("merged_master_2020.csv"))
saveRDS(merged_all, path_processed("merged_master_2020.rds"))

# Also save to original names for backwards compatibility
write_csv(merged_all, path_processed("merged_fcc_svi_airband_ookla_2020.csv"))
saveRDS(merged_all, path_processed("merged_fcc_svi_airband_ookla_2020.rds"))
write_csv(merged_all, path_processed("merged_fcc_svi_airband_2020.csv"))
saveRDS(merged_all, path_processed("merged_fcc_svi_airband_2020.rds"))

cat("Saved merged dataset with", ncol(merged_all), "columns\n")
## Saved merged dataset with 34 columns

9 6.5 Enhanced Visualizations: Digital Divide Analysis

This section provides comprehensive visualizations exploring the digital divide across rural-urban classifications, income levels, and geographic regions.

9.1 6.5.1 Rural-Urban Digital Divide: Internet Speeds

# Prepare data with RUCC categories
viz_data <- merged_all %>%
  filter(!is.na(rural_urban_cat), !is.na(avg_download_mbps))

# Box plot: Download speeds by rural-urban category
p1 <- ggplot(viz_data, aes(x = rural_urban_cat, y = avg_download_mbps, fill = rural_urban_cat)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("Metro" = "#2E86AB",
                               "Nonmetro-Adjacent" = "#F6AE2D",
                               "Nonmetro-Nonadjacent" = "#E94F37")) +
  labs(title = "Download Speeds by Rural-Urban Classification",
       subtitle = "Ookla Speedtest Data (2020)",
       x = "", y = "Average Download Speed (Mbps)") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold", size = 14))

# Box plot: Latency by rural-urban category
p2 <- ggplot(viz_data, aes(x = rural_urban_cat, y = avg_latency_ms, fill = rural_urban_cat)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("Metro" = "#2E86AB",
                               "Nonmetro-Adjacent" = "#F6AE2D",
                               "Nonmetro-Nonadjacent" = "#E94F37")) +
  labs(title = "Network Latency by Rural-Urban Classification",
       subtitle = "Higher latency = slower response times",
       x = "", y = "Average Latency (ms)") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold", size = 14))

gridExtra::grid.arrange(p1, p2, ncol = 2)

9.2 6.5.2 Speed Distribution Density by RUCC Code

viz_data2 <- merged_all %>%
  filter(!is.na(rucc_2023), !is.na(avg_download_mbps)) %>%
  mutate(rucc_label = paste0("RUCC ", rucc_2023))

ggplot(viz_data2, aes(x = avg_download_mbps, fill = factor(rucc_2023))) +
  geom_density(alpha = 0.5) +
  scale_fill_viridis_d(name = "RUCC Code", option = "plasma") +
  labs(title = "Download Speed Distribution by RUCC Code",
       subtitle = "RUCC 1-3: Metro | RUCC 4-6: Nonmetro Adjacent | RUCC 7-9: Remote Rural",
       x = "Average Download Speed (Mbps)", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14)) +
  xlim(0, 200)

9.3 6.5.3 FCC Coverage vs Actual Speeds (Ookla)

# Compare FCC reported coverage with actual measured speeds
viz_data3 <- merged_all %>%
  filter(!is.na(coverage_25_3), !is.na(avg_download_mbps))

if (nrow(viz_data3) > 0) {
  ggplot(viz_data3, aes(x = coverage_25_3 * 100, y = avg_download_mbps, color = rural_urban_cat)) +
    geom_point(alpha = 0.5, size = 2) +
    geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
    scale_color_manual(values = c("Metro" = "#2E86AB",
                                  "Nonmetro-Adjacent" = "#F6AE2D",
                                  "Nonmetro-Nonadjacent" = "#E94F37")) +
    labs(title = "FCC Reported Coverage vs Actual Measured Speeds",
         subtitle = "Does high coverage translate to fast speeds?",
         x = "FCC 25/3 Mbps Coverage (%)",
         y = "Actual Avg Download Speed (Mbps)",
         color = "Area Type") +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold", size = 14),
          legend.position = "bottom")
} else {
  cat("Note: Coverage data not available for this comparison\n")
}

9.4 6.5.4 Social Vulnerability and Broadband Access

viz_data4 <- merged_all %>%
  filter(!is.na(svi_overall), !is.na(avg_download_mbps))

# SVI vs Download Speed
p1 <- ggplot(viz_data4, aes(x = svi_overall, y = avg_download_mbps)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", linewidth = 1.2) +
  labs(title = "Social Vulnerability vs Internet Speed",
       x = "SVI Overall Percentile (higher = more vulnerable)",
       y = "Avg Download Speed (Mbps)") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12))

# SVI vs Latency
p2 <- ggplot(viz_data4, aes(x = svi_overall, y = avg_latency_ms)) +
  geom_point(alpha = 0.3, color = "coral") +
  geom_smooth(method = "lm", color = "darkred", linewidth = 1.2) +
  labs(title = "Social Vulnerability vs Network Latency",
       x = "SVI Overall Percentile",
       y = "Avg Latency (ms)") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12))

gridExtra::grid.arrange(p1, p2, ncol = 2)

9.5 6.5.5 Heatmap: Correlations Between Key Variables

# Select numeric variables for correlation
cor_vars <- merged_all %>%
  select(
    svi_overall, svi_soc, svi_hh, svi_min, svi_hous,
    tier2, tier3,
    avg_download_mbps, avg_upload_mbps, avg_latency_ms,
    airband_usage, rucc_2023
  ) %>%
  select(where(is.numeric)) %>%
  na.omit()  # Remove rows with NA for clean correlation

# Compute correlation matrix
cor_matrix <- cor(cor_vars, use = "complete.obs")

# Plot correlation heatmap (removed hclust ordering to avoid NA issues)
corrplot::corrplot(cor_matrix,
                   method = "color",
                   type = "upper",
                   order = "original",
                   tl.col = "black",
                   tl.srt = 45,
                   addCoef.col = "black",
                   number.cex = 0.7,
                   col = corrplot::COL2('RdBu', 10),
                   title = "Correlation Matrix: Digital Divide Indicators",
                   mar = c(0, 0, 2, 0))

9.6 6.5.6 Speed Tiers: Metro vs Rural Comparison

# Create speed tier categories from actual Ookla speeds
viz_data5 <- merged_all %>%
  filter(!is.na(avg_download_mbps), !is.na(rural_urban_cat)) %>%
  mutate(
    speed_tier = case_when(
      avg_download_mbps < 25 ~ "Below Basic (<25 Mbps)",
      avg_download_mbps < 100 ~ "Basic (25-100 Mbps)",
      avg_download_mbps < 250 ~ "Fast (100-250 Mbps)",
      TRUE ~ "Very Fast (250+ Mbps)"
    ),
    speed_tier = factor(speed_tier, levels = c("Below Basic (<25 Mbps)",
                                                "Basic (25-100 Mbps)",
                                                "Fast (100-250 Mbps)",
                                                "Very Fast (250+ Mbps)"))
  )

# Stacked bar chart
ggplot(viz_data5, aes(x = rural_urban_cat, fill = speed_tier)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("Below Basic (<25 Mbps)" = "#E94F37",
                               "Basic (25-100 Mbps)" = "#F6AE2D",
                               "Fast (100-250 Mbps)" = "#7FB069",
                               "Very Fast (250+ Mbps)" = "#2E86AB")) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Distribution of Actual Internet Speeds by Area Type",
       subtitle = "Based on Ookla Speedtest measurements",
       x = "", y = "Proportion of Counties",
       fill = "Speed Tier") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14),
        legend.position = "bottom") +
  coord_flip()

9.7 6.5.7 Summary Statistics Table by Rural-Urban Category

summary_stats <- merged_all %>%
  filter(!is.na(rural_urban_cat)) %>%
  group_by(rural_urban_cat) %>%
  summarise(
    n_counties = n(),
    avg_download = round(mean(avg_download_mbps, na.rm = TRUE), 1),
    median_download = round(median(avg_download_mbps, na.rm = TRUE), 1),
    avg_latency = round(mean(avg_latency_ms, na.rm = TRUE), 1),
    avg_svi = round(mean(svi_overall, na.rm = TRUE), 3),
    pct_below_25mbps = round(100 * mean(avg_download_mbps < 25, na.rm = TRUE), 1),
    .groups = "drop"
  )

cat("\n=== DIGITAL DIVIDE SUMMARY BY AREA TYPE ===\n\n")
## 
## === DIGITAL DIVIDE SUMMARY BY AREA TYPE ===
print(summary_stats)
## # A tibble: 3 × 7
##   rural_urban_cat    n_counties avg_download median_download avg_latency avg_svi
##   <chr>                   <int>        <dbl>           <dbl>       <dbl>   <dbl>
## 1 Metro                    1179        110.            114.         43.1   0.476
## 2 Nonmetro-Adjacent         654         82.1            80.6        52.4   0.58 
## 3 Nonmetro-Nonadjac…       1300         51.8            43.5        71.3   0.482
## # ℹ 1 more variable: pct_below_25mbps <dbl>
# Visual table using knitr if available
if (requireNamespace("knitr", quietly = TRUE)) {
  knitr::kable(summary_stats,
               col.names = c("Area Type", "Counties", "Avg DL (Mbps)",
                            "Median DL", "Avg Latency (ms)", "Avg SVI", "% Below 25Mbps"),
               caption = "Digital Divide Metrics by Rural-Urban Classification")
}
Digital Divide Metrics by Rural-Urban Classification
Area Type Counties Avg DL (Mbps) Median DL Avg Latency (ms) Avg SVI % Below 25Mbps
Metro 1179 109.5 114.2 43.1 0.476 3.9
Nonmetro-Adjacent 654 82.1 80.6 52.4 0.580 2.8
Nonmetro-Nonadjacent 1300 51.8 43.5 71.3 0.482 25.1

9.8 6.5.8 Geographic Distribution: RUCC Codes

# Note: This chunk creates a US map colored by RUCC codes
# Set eval=TRUE to generate (requires counties_sf shapefile)

# Load shapefile if available
if (file.exists(path_processed("counties_sf.rds"))) {
  counties_sf <- readRDS(path_processed("counties_sf.rds"))

  # Join with RUCC data
  map_data <- counties_sf %>%
    left_join(merged_all %>% select(county_fips, rucc_2023, rural_urban_cat),
              by = c("GEOID" = "county_fips"))

  # Filter to continental US
  map_data <- map_data %>%
    filter(!STATEFP %in% c("02", "15", "60", "66", "69", "72", "78"))

  ggplot(map_data) +
    geom_sf(aes(fill = rural_urban_cat), color = NA) +
    scale_fill_manual(values = c("Metro" = "#2E86AB",
                                 "Nonmetro-Adjacent" = "#F6AE2D",
                                 "Nonmetro-Nonadjacent" = "#E94F37"),
                      na.value = "gray90") +
    labs(title = "Rural-Urban Classification of US Counties",
         subtitle = "USDA Rural-Urban Continuum Codes (2023)",
         fill = "Category") +
    theme_void() +
    theme(plot.title = element_text(face = "bold", size = 16),
          legend.position = "bottom")
}

10 7. County-Level Shapefile Processing

To conduct spatial analysis and merge all county-level datasets consistently, we first processed the 2020 U.S. Census TIGER/Line county shapefile. This shapefile provides the geographic boundary polygons and the official county FIPS codes (GEOID) used as the primary spatial join key in later stages of the project.

11 7.1 Load Libraries for Spatial Data

sf::sf_use_s2(FALSE)

12 7.2 Read the County Shapefile

We now load the 2020 U.S. Census TIGER/Line county shapefile into R using the sf package. This file contains county boundary polygons and the GEOID field that will be used later for spatial joining with ACS, SVI, FCC, and Airband datasets.

shp_path <- "raw_data/geographic/tl_2020_us_county.shp"

counties_raw <- sf::st_read(shp_path)
## Reading layer `tl_2020_us_county' from data source 
##   `/Users/yogasundaramramaswamy/Downloads/digital_divide_project/raw_data/geographic/tl_2020_us_county.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 3234 features and 17 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS:  NAD83

#7.3 Select Relevant Columns

The county shapefile contains many fields, but for this project we only need the attributes required for merging and mapping. Specifically, we keep:

GEOID – 5-digit county FIPS (primary join key)

NAME – county name

STATEFP – state FIPS code

COUNTYFP – county number within the state

geometry – polygon boundary information

counties_sf <- counties_raw %>% 
  dplyr::select(GEOID, NAME, STATEFP, COUNTYFP, geometry)

counties_sf
## Simple feature collection with 3234 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS:  NAD83
## First 10 features:
##    GEOID        NAME STATEFP COUNTYFP                       geometry
## 1  31039      Cuming      31      039 MULTIPOLYGON (((-97.01952 4...
## 2  53069   Wahkiakum      53      069 MULTIPOLYGON (((-123.4364 4...
## 3  35011     De Baca      35      011 MULTIPOLYGON (((-104.5674 3...
## 4  31109   Lancaster      31      109 MULTIPOLYGON (((-96.91075 4...
## 5  31129    Nuckolls      31      129 MULTIPOLYGON (((-98.27367 4...
## 6  72085 Las Piedras      72      085 MULTIPOLYGON (((-65.91048 1...
## 7  46099   Minnehaha      46      099 MULTIPOLYGON (((-96.89022 4...
## 8  48327      Menard      48      327 MULTIPOLYGON (((-99.82187 3...
## 9  06091      Sierra      06      091 MULTIPOLYGON (((-120.6556 3...
## 10 21053     Clinton      21      053 MULTIPOLYGON (((-85.2391 36...

13 7.4 Standardize the GEOID (County FIPS Code)

Different datasets (SVI, ACS, FCC, Airband) use FIPS codes to identify counties. To ensure all datasets merge correctly, we convert the GEOID column from the shapefile into a 5-digit character string with leading zeros.

Example:

1001 → 01001

3501 → 03501

This prevents merge errors and ensures consistency across all data sources.

counties_sf <- counties_sf %>% 
  mutate(
    GEOID = as.character(GEOID),              # convert to character
    GEOID = stringr::str_pad(GEOID,           # pad with leading zeros
                             width = 5, 
                             side = "left", 
                             pad = "0")
  )

# Display first 20 FIPS codes to verify correctness
head(counties_sf$GEOID, 20)
##  [1] "31039" "53069" "35011" "31109" "31129" "72085" "46099" "48327" "06091"
## [10] "21053" "39063" "48189" "01027" "48011" "39003" "13189" "55111" "05137"
## [19] "41063" "42007"

14 7.5 Re-project County Shapefile for Mapping

The TIGER/Line county shapefile is in the NAD83 coordinate reference system. For most visualization tools (e.g., ggplot2, tmap, leaflet), it is convenient to use WGS84 (EPSG:4326). Here, we reproject the spatial object once so it is ready for mapping and spatial analysis.

counties_sf <- st_transform(counties_sf, crs = 4326)

# Quick check of the CRS and object
counties_sf
## Simple feature collection with 3234 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS:  WGS 84
## First 10 features:
##    GEOID        NAME STATEFP COUNTYFP                       geometry
## 1  31039      Cuming      31      039 MULTIPOLYGON (((-97.01952 4...
## 2  53069   Wahkiakum      53      069 MULTIPOLYGON (((-123.4364 4...
## 3  35011     De Baca      35      011 MULTIPOLYGON (((-104.5674 3...
## 4  31109   Lancaster      31      109 MULTIPOLYGON (((-96.91075 4...
## 5  31129    Nuckolls      31      129 MULTIPOLYGON (((-98.27367 4...
## 6  72085 Las Piedras      72      085 MULTIPOLYGON (((-65.91048 1...
## 7  46099   Minnehaha      46      099 MULTIPOLYGON (((-96.89022 4...
## 8  48327      Menard      48      327 MULTIPOLYGON (((-99.82187 3...
## 9  06091      Sierra      06      091 MULTIPOLYGON (((-120.6556 3...
## 10 21053     Clinton      21      053 MULTIPOLYGON (((-85.2391 36...
st_crs(counties_sf)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         MEMBER["World Geodetic System 1984 (G2296)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

15 7.6 Save Cleaned County Shapefile

# Save Cleaned County Spatial Data as RDS

# Save directly into your existing processed_data folder
saveRDS(counties_sf, "processed_data/counties_sf.rds")

# Confirm it saved
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"        
##  [2] "acs_all_2020_county_clean.rds"        
##  [3] "acs_b15003_2020_county_clean.csv"     
##  [4] "acs_b15003_2020_county_clean.rds"     
##  [5] "acs_b19013_2020_county_clean.csv"     
##  [6] "acs_b19013_2020_county_clean.rds"     
##  [7] "acs_b28002_2020_county_clean.csv"     
##  [8] "acs_b28002_2020_county_clean.rds"     
##  [9] "airband_2020_county_clean.csv"        
## [10] "airband_2020_county_clean.rds"        
## [11] "airband_inconsistency_counties.csv"   
## [12] "analysis_2020_county_final_sf.rds"    
## [13] "analysis_2020_county_final.csv"       
## [14] "analysis_2020_county_final.rds"       
## [15] "analysis_2020_county_sf.rds"          
## [16] "analysis_2020_county.csv"             
## [17] "analysis_2020_county.rds"             
## [18] "counties_sf.rds"                      
## [19] "fcc_2020_dec_clean.csv"               
## [20] "fcc_2020_dec_clean.rds"               
## [21] "fcc_2025_may_clean.csv"               
## [22] "fcc_2025_may_clean.rds"               
## [23] "master_2020_county_sf.rds"            
## [24] "master_2020_county.csv"               
## [25] "master_2020_county.rds"               
## [26] "merged_fcc_svi_2020.csv"              
## [27] "merged_fcc_svi_2020.rds"              
## [28] "merged_fcc_svi_airband_2020.csv"      
## [29] "merged_fcc_svi_airband_2020.rds"      
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"               
## [33] "merged_master_2020.rds"               
## [34] "ookla_2020_county_clean.csv"          
## [35] "ookla_2020_county_clean.rds"          
## [36] "svi_2020_county_clean.csv"            
## [37] "svi_2020_county_clean.rds"            
## [38] "usda_rucc_2023_clean.csv"             
## [39] "usda_rucc_2023_clean.rds"

16 8. Processing ACS (American Community Survey) Data

The American Community Survey (ACS) provides essential socioeconomic indicators that are critical for understanding digital inequality across U.S. counties. Unlike datasets that only measure broadband availability or usage, ACS tables contain detailed information about household income, educational attainment, and internet/computer access—all of which are deeply connected to digital divide outcomes.

17 8.1 Clean ACS B28002 (Internet & Computer Use)

The ACS table B28002 provides detailed information about internet and computer access at the county level. This dataset is essential for understanding digital adoption, device availability, and technology access across communities. However, the raw ACS download includes extra columns—such as margins of error (M columns), metadata, and a GEO_ID field containing long identifiers.

# Load ACS B28002
b28002_path <- "raw_data/census/ACSDT5Y2020.B28002-Data.csv"
acs_b28002_raw <- read_csv(b28002_path)

# Remove label row
acs_b28002_raw <- acs_b28002_raw %>% 
  filter(GEO_ID != "Geography")

# Keep only important columns
acs_b28002_clean <- acs_b28002_raw %>% 
  select(
    GEO_ID, NAME,
    B28002_001E,  # total households
    B28002_002E,  # broadband households
    B28002_012E,  # households with no internet
    B28002_013E   # households with no computer
    # OPTIONAL: add B28002_004E, _005E, _006E if needed
  ) %>%
  mutate(
    FIPS = str_sub(GEO_ID, -5, -1),
    FIPS = str_pad(FIPS, 5, pad = "0")
  ) %>%
  select(FIPS, GEO_ID, NAME, everything())

# Preview
head(acs_b28002_clean)
## # A tibble: 6 × 7
##   FIPS  GEO_ID         NAME      B28002_001E B28002_002E B28002_012E B28002_013E
##   <chr> <chr>          <chr>     <chr>       <chr>       <chr>       <chr>      
## 1 01001 0500000US01001 Autauga … 21559       17850       402         3307       
## 2 01003 0500000US01003 Baldwin … 84047       71880       2149        10018      
## 3 01005 0500000US01005 Barbour … 9322        6059        563         2700       
## 4 01007 0500000US01007 Bibb Cou… 7259        5529        210         1520       
## 5 01009 0500000US01009 Blount C… 21205       16971       314         3920       
## 6 01011 0500000US01011 Bullock … 3429        2153        236         1040

18 8.2 Save Cleaned B28002 Dataset

# Save to your existing processed_data folder
write_csv(acs_b28002_clean, "processed_data/acs_b28002_2020_county_clean.csv")
saveRDS(acs_b28002_clean,  "processed_data/acs_b28002_2020_county_clean.rds")

# Confirm that the files saved
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"        
##  [2] "acs_all_2020_county_clean.rds"        
##  [3] "acs_b15003_2020_county_clean.csv"     
##  [4] "acs_b15003_2020_county_clean.rds"     
##  [5] "acs_b19013_2020_county_clean.csv"     
##  [6] "acs_b19013_2020_county_clean.rds"     
##  [7] "acs_b28002_2020_county_clean.csv"     
##  [8] "acs_b28002_2020_county_clean.rds"     
##  [9] "airband_2020_county_clean.csv"        
## [10] "airband_2020_county_clean.rds"        
## [11] "airband_inconsistency_counties.csv"   
## [12] "analysis_2020_county_final_sf.rds"    
## [13] "analysis_2020_county_final.csv"       
## [14] "analysis_2020_county_final.rds"       
## [15] "analysis_2020_county_sf.rds"          
## [16] "analysis_2020_county.csv"             
## [17] "analysis_2020_county.rds"             
## [18] "counties_sf.rds"                      
## [19] "fcc_2020_dec_clean.csv"               
## [20] "fcc_2020_dec_clean.rds"               
## [21] "fcc_2025_may_clean.csv"               
## [22] "fcc_2025_may_clean.rds"               
## [23] "master_2020_county_sf.rds"            
## [24] "master_2020_county.csv"               
## [25] "master_2020_county.rds"               
## [26] "merged_fcc_svi_2020.csv"              
## [27] "merged_fcc_svi_2020.rds"              
## [28] "merged_fcc_svi_airband_2020.csv"      
## [29] "merged_fcc_svi_airband_2020.rds"      
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"               
## [33] "merged_master_2020.rds"               
## [34] "ookla_2020_county_clean.csv"          
## [35] "ookla_2020_county_clean.rds"          
## [36] "svi_2020_county_clean.csv"            
## [37] "svi_2020_county_clean.rds"            
## [38] "usda_rucc_2023_clean.csv"             
## [39] "usda_rucc_2023_clean.rds"

19 8.3 Clean ACS B19013 (Median Household Income)

ACS table B19013 contains the median household income for each U.S. county. Income is one of the strongest predictors of digital access, affordability, and adoption — making it a critical variable in your digital divide analysis.

Because the raw ACS file includes:

metadata rows

margin of error (_M) columns

long GEO_ID fields

We must clean and standardize the dataset before merging it with other county-level data.

# 8.2 Load raw ACS B19013 data
b19013_path <- "raw_data/census/ACSDT5Y2020.B19013-Data.csv"
acs_b19013_raw <- read_csv(b19013_path)

# Remove the label row
acs_b19013_raw <- acs_b19013_raw %>% 
  filter(GEO_ID != "Geography")

# Keep only GEO_ID, NAME, and the income estimate column
acs_b19013_clean <- acs_b19013_raw %>% 
  select(
    GEO_ID,
    NAME,
    B19013_001E   # median household income
  ) %>% 
  mutate(
    FIPS = str_sub(GEO_ID, -5, -1),
    FIPS = str_pad(FIPS, 5, pad = "0")
  ) %>% 
  select(FIPS, GEO_ID, NAME, everything())

# Preview cleaned file
head(acs_b19013_clean)
## # A tibble: 6 × 4
##   FIPS  GEO_ID         NAME                    B19013_001E
##   <chr> <chr>          <chr>                   <chr>      
## 1 01001 0500000US01001 Autauga County, Alabama 57982      
## 2 01003 0500000US01003 Baldwin County, Alabama 61756      
## 3 01005 0500000US01005 Barbour County, Alabama 34990      
## 4 01007 0500000US01007 Bibb County, Alabama    51721      
## 5 01009 0500000US01009 Blount County, Alabama  48922      
## 6 01011 0500000US01011 Bullock County, Alabama 33866

20 8.4 Save the cleaned Median Income File

write_csv(acs_b19013_clean, "processed_data/acs_b19013_2020_county_clean.csv")
saveRDS(acs_b19013_clean,  "processed_data/acs_b19013_2020_county_clean.rds")

# Confirm saved files
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"        
##  [2] "acs_all_2020_county_clean.rds"        
##  [3] "acs_b15003_2020_county_clean.csv"     
##  [4] "acs_b15003_2020_county_clean.rds"     
##  [5] "acs_b19013_2020_county_clean.csv"     
##  [6] "acs_b19013_2020_county_clean.rds"     
##  [7] "acs_b28002_2020_county_clean.csv"     
##  [8] "acs_b28002_2020_county_clean.rds"     
##  [9] "airband_2020_county_clean.csv"        
## [10] "airband_2020_county_clean.rds"        
## [11] "airband_inconsistency_counties.csv"   
## [12] "analysis_2020_county_final_sf.rds"    
## [13] "analysis_2020_county_final.csv"       
## [14] "analysis_2020_county_final.rds"       
## [15] "analysis_2020_county_sf.rds"          
## [16] "analysis_2020_county.csv"             
## [17] "analysis_2020_county.rds"             
## [18] "counties_sf.rds"                      
## [19] "fcc_2020_dec_clean.csv"               
## [20] "fcc_2020_dec_clean.rds"               
## [21] "fcc_2025_may_clean.csv"               
## [22] "fcc_2025_may_clean.rds"               
## [23] "master_2020_county_sf.rds"            
## [24] "master_2020_county.csv"               
## [25] "master_2020_county.rds"               
## [26] "merged_fcc_svi_2020.csv"              
## [27] "merged_fcc_svi_2020.rds"              
## [28] "merged_fcc_svi_airband_2020.csv"      
## [29] "merged_fcc_svi_airband_2020.rds"      
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"               
## [33] "merged_master_2020.rds"               
## [34] "ookla_2020_county_clean.csv"          
## [35] "ookla_2020_county_clean.rds"          
## [36] "svi_2020_county_clean.csv"            
## [37] "svi_2020_county_clean.rds"            
## [38] "usda_rucc_2023_clean.csv"             
## [39] "usda_rucc_2023_clean.rds"

21 8.5 Clean ACS B15003 (Educational Attainment)

ACS table B15003 provides counts of the population by educational attainment level. For digital divide research, education is one of the most important predictors of digital literacy, employment opportunities, and broadband adoption.

library(dplyr)
library(readr)
library(stringr)

# 8.3 Load raw ACS B15003 education data
b15003_path <- "raw_data/census/ACSDT5Y2020.B15003-Data.csv"
acs_b15003_raw <- read_csv(b15003_path)

# Remove descriptive header row
acs_b15003_raw <- acs_b15003_raw %>% 
  filter(GEO_ID != "Geography")

# Keep key educational variables
acs_b15003_clean <- acs_b15003_raw %>% 
  select(
    GEO_ID,
    NAME,
    B15003_001E,  # total population age 25+
    B15003_017E,  # high school graduate
    B15003_022E,  # bachelor's degree
    B15003_023E,  # master's degree
    B15003_025E   # doctorate degree
  ) %>%
  mutate(
    FIPS = str_sub(GEO_ID, -5, -1),
    FIPS = str_pad(FIPS, 5, pad = "0")
  ) %>%
  select(FIPS, GEO_ID, NAME, everything())

# Preview
head(acs_b15003_clean)
## # A tibble: 6 × 8
##   FIPS  GEO_ID NAME  B15003_001E B15003_017E B15003_022E B15003_023E B15003_025E
##   <chr> <chr>  <chr> <chr>       <chr>       <chr>       <chr>       <chr>      
## 1 01001 05000… Auta… 37860       9663        6320        3399        498        
## 2 01003 05000… Bald… 155563      34521       31444       13331       2016       
## 3 01005 05000… Barb… 17797       5171        1296        540         113        
## 4 01007 05000… Bibb… 15987       5879        1183        481         41         
## 5 01009 05000… Blou… 39814       10582       3540        1382        132        
## 6 01011 05000… Bull… 6784        2498        413         152         28

22 8.3.2 Save cleaned B15003 dataset

write_csv(acs_b15003_clean, "processed_data/acs_b15003_2020_county_clean.csv")
saveRDS(acs_b15003_clean,  "processed_data/acs_b15003_2020_county_clean.rds")

# Confirm saved files
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"        
##  [2] "acs_all_2020_county_clean.rds"        
##  [3] "acs_b15003_2020_county_clean.csv"     
##  [4] "acs_b15003_2020_county_clean.rds"     
##  [5] "acs_b19013_2020_county_clean.csv"     
##  [6] "acs_b19013_2020_county_clean.rds"     
##  [7] "acs_b28002_2020_county_clean.csv"     
##  [8] "acs_b28002_2020_county_clean.rds"     
##  [9] "airband_2020_county_clean.csv"        
## [10] "airband_2020_county_clean.rds"        
## [11] "airband_inconsistency_counties.csv"   
## [12] "analysis_2020_county_final_sf.rds"    
## [13] "analysis_2020_county_final.csv"       
## [14] "analysis_2020_county_final.rds"       
## [15] "analysis_2020_county_sf.rds"          
## [16] "analysis_2020_county.csv"             
## [17] "analysis_2020_county.rds"             
## [18] "counties_sf.rds"                      
## [19] "fcc_2020_dec_clean.csv"               
## [20] "fcc_2020_dec_clean.rds"               
## [21] "fcc_2025_may_clean.csv"               
## [22] "fcc_2025_may_clean.rds"               
## [23] "master_2020_county_sf.rds"            
## [24] "master_2020_county.csv"               
## [25] "master_2020_county.rds"               
## [26] "merged_fcc_svi_2020.csv"              
## [27] "merged_fcc_svi_2020.rds"              
## [28] "merged_fcc_svi_airband_2020.csv"      
## [29] "merged_fcc_svi_airband_2020.rds"      
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"               
## [33] "merged_master_2020.rds"               
## [34] "ookla_2020_county_clean.csv"          
## [35] "ookla_2020_county_clean.rds"          
## [36] "svi_2020_county_clean.csv"            
## [37] "svi_2020_county_clean.rds"            
## [38] "usda_rucc_2023_clean.csv"             
## [39] "usda_rucc_2023_clean.rds"
library(dplyr)
library(readr)
library(stringr)

# 8.4.1 Load cleaned ACS datasets
acs_b28002 <- readRDS("processed_data/acs_b28002_2020_county_clean.rds")
acs_b19013 <- readRDS("processed_data/acs_b19013_2020_county_clean.rds")
acs_b15003 <- readRDS("processed_data/acs_b15003_2020_county_clean.rds")

# 8.4.2 Prepare B28002 (internet/computer)
acs_b28002_small <- acs_b28002 %>% 
  dplyr::rename(
    internet_total_hh   = B28002_001E,
    internet_broadband  = B28002_002E,
    internet_no_access  = B28002_012E,
    computer_no_device  = B28002_013E
  )

# 8.4.3 Prepare B19013 (income)
acs_b19013_small <- acs_b19013 %>% 
  dplyr::select(
    FIPS,
    income_median = B19013_001E
  )

# 8.4.4 Prepare B15003 (education)
acs_b15003_small <- acs_b15003 %>% 
  dplyr::select(
    FIPS,
    edu_total_25plus = B15003_001E,
    edu_hs           = B15003_017E,
    edu_bach         = B15003_022E,
    edu_mast         = B15003_023E,
    edu_doc          = B15003_025E
  )

# 8.4.5 Merge all three ACS datasets
acs_2020_all <- acs_b28002_small %>%
  dplyr::left_join(acs_b19013_small, by = "FIPS") %>%
  dplyr::left_join(acs_b15003_small, by = "FIPS")

glimpse(acs_2020_all)
## Rows: 3,221
## Columns: 13
## $ FIPS               <chr> "01001", "01003", "01005", "01007", "01009", "01011…
## $ GEO_ID             <chr> "0500000US01001", "0500000US01003", "0500000US01005…
## $ NAME               <chr> "Autauga County, Alabama", "Baldwin County, Alabama…
## $ internet_total_hh  <chr> "21559", "84047", "9322", "7259", "21205", "3429", …
## $ internet_broadband <chr> "17850", "71880", "6059", "5529", "16971", "2153", …
## $ internet_no_access <chr> "402", "2149", "563", "210", "314", "236", "150", "…
## $ computer_no_device <chr> "3307", "10018", "2700", "1520", "3920", "1040", "1…
## $ income_median      <chr> "57982", "61756", "34990", "51721", "48922", "33866…
## $ edu_total_25plus   <chr> "37860", "155563", "17797", "15987", "39814", "6784…
## $ edu_hs             <chr> "9663", "34521", "5171", "5879", "10582", "2498", "…
## $ edu_bach           <chr> "6320", "31444", "1296", "1183", "3540", "413", "13…
## $ edu_mast           <chr> "3399", "13331", "540", "481", "1382", "152", "651"…
## $ edu_doc            <chr> "498", "2016", "113", "41", "132", "28", "19", "883…
head(acs_2020_all)
## # A tibble: 6 × 13
##   FIPS  GEO_ID     NAME  internet_total_hh internet_broadband internet_no_access
##   <chr> <chr>      <chr> <chr>             <chr>              <chr>             
## 1 01001 0500000US… Auta… 21559             17850              402               
## 2 01003 0500000US… Bald… 84047             71880              2149              
## 3 01005 0500000US… Barb… 9322              6059               563               
## 4 01007 0500000US… Bibb… 7259              5529               210               
## 5 01009 0500000US… Blou… 21205             16971              314               
## 6 01011 0500000US… Bull… 3429              2153               236               
## # ℹ 7 more variables: computer_no_device <chr>, income_median <chr>,
## #   edu_total_25plus <chr>, edu_hs <chr>, edu_bach <chr>, edu_mast <chr>,
## #   edu_doc <chr>
# 8.4.x Ensure ACS numeric columns are numeric, not character

acs_2020_all <- acs_2020_all %>%
  mutate(
    across(
      c(
        internet_total_hh,
        internet_broadband,
        internet_no_access,
        computer_no_device,
        income_median,
        edu_total_25plus,
        edu_hs,
        edu_bach,
        edu_mast,
        edu_doc
      ),
      ~ as.numeric(.)
    )
  )

# Check types again
glimpse(acs_2020_all)
## Rows: 3,221
## Columns: 13
## $ FIPS               <chr> "01001", "01003", "01005", "01007", "01009", "01011…
## $ GEO_ID             <chr> "0500000US01001", "0500000US01003", "0500000US01005…
## $ NAME               <chr> "Autauga County, Alabama", "Baldwin County, Alabama…
## $ internet_total_hh  <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, 44572,…
## $ internet_broadband <dbl> 17850, 71880, 6059, 5529, 16971, 2153, 4941, 35642,…
## $ internet_no_access <dbl> 402, 2149, 563, 210, 314, 236, 150, 931, 284, 250, …
## $ computer_no_device <dbl> 3307, 10018, 2700, 1520, 3920, 1040, 1558, 7999, 31…
## $ income_median      <dbl> 57982, 61756, 34990, 51721, 48922, 33866, 44850, 50…
## $ edu_total_25plus   <dbl> 37860, 155563, 17797, 15987, 39814, 6784, 13820, 78…
## $ edu_hs             <dbl> 9663, 34521, 5171, 5879, 10582, 2498, 5583, 21801, …
## $ edu_bach           <dbl> 6320, 31444, 1296, 1183, 3540, 413, 1396, 8296, 249…
## $ edu_mast           <dbl> 3399, 13331, 540, 481, 1382, 152, 651, 4618, 740, 7…
## $ edu_doc            <dbl> 498, 2016, 113, 41, 132, 28, 19, 883, 29, 132, 106,…
# Save the final cleaned + numeric + merged ACS dataset

write_csv(acs_2020_all, "processed_data/acs_all_2020_county_clean.csv")
saveRDS(acs_2020_all,  "processed_data/acs_all_2020_county_clean.rds")

# Confirm saved
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"        
##  [2] "acs_all_2020_county_clean.rds"        
##  [3] "acs_b15003_2020_county_clean.csv"     
##  [4] "acs_b15003_2020_county_clean.rds"     
##  [5] "acs_b19013_2020_county_clean.csv"     
##  [6] "acs_b19013_2020_county_clean.rds"     
##  [7] "acs_b28002_2020_county_clean.csv"     
##  [8] "acs_b28002_2020_county_clean.rds"     
##  [9] "airband_2020_county_clean.csv"        
## [10] "airband_2020_county_clean.rds"        
## [11] "airband_inconsistency_counties.csv"   
## [12] "analysis_2020_county_final_sf.rds"    
## [13] "analysis_2020_county_final.csv"       
## [14] "analysis_2020_county_final.rds"       
## [15] "analysis_2020_county_sf.rds"          
## [16] "analysis_2020_county.csv"             
## [17] "analysis_2020_county.rds"             
## [18] "counties_sf.rds"                      
## [19] "fcc_2020_dec_clean.csv"               
## [20] "fcc_2020_dec_clean.rds"               
## [21] "fcc_2025_may_clean.csv"               
## [22] "fcc_2025_may_clean.rds"               
## [23] "master_2020_county_sf.rds"            
## [24] "master_2020_county.csv"               
## [25] "master_2020_county.rds"               
## [26] "merged_fcc_svi_2020.csv"              
## [27] "merged_fcc_svi_2020.rds"              
## [28] "merged_fcc_svi_airband_2020.csv"      
## [29] "merged_fcc_svi_airband_2020.rds"      
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"               
## [33] "merged_master_2020.rds"               
## [34] "ookla_2020_county_clean.csv"          
## [35] "ookla_2020_county_clean.rds"          
## [36] "svi_2020_county_clean.csv"            
## [37] "svi_2020_county_clean.rds"            
## [38] "usda_rucc_2023_clean.csv"             
## [39] "usda_rucc_2023_clean.rds"

23 9. Merge All Cleaned Datasets into One Master Dataset

Section 9 creates a single combined dataset by merging:

FCC broadband availability

SVI (overall + 4 themes)

Microsoft Airband broadband usage

ACS (internet, computer access, income, education)

24 9.1 Load All Cleaned Datasets

library(dplyr)
library(readr)

# Load FCC + SVI + Airband merged dataset
merged_fcc_svi_airband <- readRDS("processed_data/merged_fcc_svi_airband_2020.rds")

# Load merged ACS dataset
acs_2020_all <- readRDS("processed_data/acs_all_2020_county_clean.rds")

# Quick preview
head(merged_fcc_svi_airband)
## # A tibble: 6 × 34
##   county_fips state_name county_name.x  housing_units tier1 tier2 tier3
##   <chr>       <chr>      <chr>                  <dbl> <int> <int> <int>
## 1 01001       AL         Autauga County         27544     4     4     4
## 2 01003       AL         Baldwin County        137072     4     2     2
## 3 01005       AL         Barbour County         15445     4     2     2
## 4 01007       AL         Bibb County            11168     4     1     1
## 5 01009       AL         Blount County          31265     4     1     1
## 6 01011       AL         Bullock County          6136     4     3     1
## # ℹ 27 more variables: coverage_25_3 <dbl>, coverage_100_20 <dbl>,
## #   coverage_250_25 <dbl>, coverage_1000_100 <dbl>, state.x <chr>,
## #   county <chr>, svi_overall <dbl>, svi_soc <dbl>, svi_hh <dbl>,
## #   svi_min <dbl>, svi_hous <dbl>, state.y <chr>, county_name.y <chr>,
## #   airband_fcc_availability <chr>, airband_usage <chr>,
## #   avg_download_mbps <dbl>, avg_upload_mbps <dbl>, avg_latency_ms <dbl>,
## #   median_download_mbps <dbl>, median_upload_mbps <dbl>, total_tests <int>, …
head(acs_2020_all)
## # A tibble: 6 × 13
##   FIPS  GEO_ID     NAME  internet_total_hh internet_broadband internet_no_access
##   <chr> <chr>      <chr>             <dbl>              <dbl>              <dbl>
## 1 01001 0500000US… Auta…             21559              17850                402
## 2 01003 0500000US… Bald…             84047              71880               2149
## 3 01005 0500000US… Barb…              9322               6059                563
## 4 01007 0500000US… Bibb…              7259               5529                210
## 5 01009 0500000US… Blou…             21205              16971                314
## 6 01011 0500000US… Bull…              3429               2153                236
## # ℹ 7 more variables: computer_no_device <dbl>, income_median <dbl>,
## #   edu_total_25plus <dbl>, edu_hs <dbl>, edu_bach <dbl>, edu_mast <dbl>,
## #   edu_doc <dbl>
merged_fcc_svi_airband <- readRDS("processed_data/merged_fcc_svi_airband_2020.rds")

25 9.2 Merge Core Dataset with ACS

Merging is simple: Left join so that FIPS from FCC/SVI/Airband stays the base.

# 9.2 Create Master 2020 County Dataset

# 9.2.1 Ensure common key name: county_fips -> FIPS
merged_fcc_svi_airband <- merged_fcc_svi_airband %>%
  dplyr::rename(FIPS = county_fips)

# 9.2.2 Merge FCC + SVI + Airband with ACS (by FIPS)
master_2020 <- merged_fcc_svi_airband %>%
  dplyr::left_join(acs_2020_all, by = "FIPS")

# 9.2.3 Convert Airband columns to numeric (if they are stored as character)
master_2020 <- master_2020 %>%
  dplyr::mutate(
    airband_fcc_availability = as.numeric(airband_fcc_availability),
    airband_usage            = as.numeric(airband_usage)
  )

# 9.2.4 Clean up duplicate state/county columns and keep a tidy set
master_2020 <- master_2020 %>%
  dplyr::select(
    FIPS,
    state_name,
    county_name = county_name.y,   # final county name
    housing_units,
    tier1, tier2, tier3,
    svi_overall, svi_soc, svi_hh, svi_min, svi_hous,
    airband_fcc_availability,
    airband_usage,
    GEO_ID, NAME,
    internet_total_hh,
    internet_broadband,
    internet_no_access,
    computer_no_device,
    income_median,
    edu_total_25plus,
    edu_hs,
    edu_bach,
    edu_mast,
    edu_doc
  )

# 9.2.5 Quick structural check
glimpse(master_2020)
## Rows: 3,141
## Columns: 26
## $ FIPS                     <chr> "01001", "01003", "01005", "01007", "01009", …
## $ state_name               <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL…
## $ county_name              <chr> "Autauga County", "Baldwin County", "Barbour …
## $ housing_units            <dbl> 27544, 137072, 15445, 11168, 31265, 6136, 128…
## $ tier1                    <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ tier2                    <int> 4, 2, 2, 1, 1, 3, 2, 4, 3, 1, 1, 1, 1, 1, 1, …
## $ tier3                    <int> 4, 2, 2, 1, 1, 1, 2, 4, 3, 1, 1, 1, 1, 1, 1, …
## $ svi_overall              <dbl> 0.5130, 0.3103, 0.9927, 0.8078, 0.5137, 0.831…
## $ svi_soc                  <dbl> 0.3838, 0.3253, 0.9567, 0.8008, 0.5757, 0.718…
## $ svi_hh                   <dbl> 0.7362, 0.2724, 0.9453, 0.5512, 0.7225, 0.651…
## $ svi_min                  <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.980…
## $ svi_hous                 <dbl> 0.4309, 0.3612, 0.9949, 0.8622, 0.2743, 0.850…
## $ airband_fcc_availability <dbl> 0.8057, 0.8362, 0.6891, 0.3368, 0.7580, 0.936…
## $ airband_usage            <dbl> 0.391, 0.452, 0.324, 0.136, 0.199, 0.157, 0.1…
## $ GEO_ID                   <chr> "0500000US01001", "0500000US01003", "0500000U…
## $ NAME                     <chr> "Autauga County, Alabama", "Baldwin County, A…
## $ internet_total_hh        <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, …
## $ internet_broadband       <dbl> 17850, 71880, 6059, 5529, 16971, 2153, 4941, …
## $ internet_no_access       <dbl> 402, 2149, 563, 210, 314, 236, 150, 931, 284,…
## $ computer_no_device       <dbl> 3307, 10018, 2700, 1520, 3920, 1040, 1558, 79…
## $ income_median            <dbl> 57982, 61756, 34990, 51721, 48922, 33866, 448…
## $ edu_total_25plus         <dbl> 37860, 155563, 17797, 15987, 39814, 6784, 138…
## $ edu_hs                   <dbl> 9663, 34521, 5171, 5879, 10582, 2498, 5583, 2…
## $ edu_bach                 <dbl> 6320, 31444, 1296, 1183, 3540, 413, 1396, 829…
## $ edu_mast                 <dbl> 3399, 13331, 540, 481, 1382, 152, 651, 4618, …
## $ edu_doc                  <dbl> 498, 2016, 113, 41, 132, 28, 19, 883, 29, 132…
head(master_2020)
## # A tibble: 6 × 26
##   FIPS  state_name county_name    housing_units tier1 tier2 tier3 svi_overall
##   <chr> <chr>      <chr>                  <dbl> <int> <int> <int>       <dbl>
## 1 01001 AL         Autauga County         27544     4     4     4       0.513
## 2 01003 AL         Baldwin County        137072     4     2     2       0.310
## 3 01005 AL         Barbour County         15445     4     2     2       0.993
## 4 01007 AL         Bibb County            11168     4     1     1       0.808
## 5 01009 AL         Blount County          31265     4     1     1       0.514
## 6 01011 AL         Bullock County          6136     4     3     1       0.831
## # ℹ 18 more variables: svi_soc <dbl>, svi_hh <dbl>, svi_min <dbl>,
## #   svi_hous <dbl>, airband_fcc_availability <dbl>, airband_usage <dbl>,
## #   GEO_ID <chr>, NAME <chr>, internet_total_hh <dbl>,
## #   internet_broadband <dbl>, internet_no_access <dbl>,
## #   computer_no_device <dbl>, income_median <dbl>, edu_total_25plus <dbl>,
## #   edu_hs <dbl>, edu_bach <dbl>, edu_mast <dbl>, edu_doc <dbl>
# 9.3 Save Final Master 2020 County Dataset

write_csv(master_2020, "processed_data/master_2020_county.csv")
saveRDS(master_2020,  "processed_data/master_2020_county.rds")

# Check that it's there
list.files("processed_data")
##  [1] "acs_all_2020_county_clean.csv"        
##  [2] "acs_all_2020_county_clean.rds"        
##  [3] "acs_b15003_2020_county_clean.csv"     
##  [4] "acs_b15003_2020_county_clean.rds"     
##  [5] "acs_b19013_2020_county_clean.csv"     
##  [6] "acs_b19013_2020_county_clean.rds"     
##  [7] "acs_b28002_2020_county_clean.csv"     
##  [8] "acs_b28002_2020_county_clean.rds"     
##  [9] "airband_2020_county_clean.csv"        
## [10] "airband_2020_county_clean.rds"        
## [11] "airband_inconsistency_counties.csv"   
## [12] "analysis_2020_county_final_sf.rds"    
## [13] "analysis_2020_county_final.csv"       
## [14] "analysis_2020_county_final.rds"       
## [15] "analysis_2020_county_sf.rds"          
## [16] "analysis_2020_county.csv"             
## [17] "analysis_2020_county.rds"             
## [18] "counties_sf.rds"                      
## [19] "fcc_2020_dec_clean.csv"               
## [20] "fcc_2020_dec_clean.rds"               
## [21] "fcc_2025_may_clean.csv"               
## [22] "fcc_2025_may_clean.rds"               
## [23] "master_2020_county_sf.rds"            
## [24] "master_2020_county.csv"               
## [25] "master_2020_county.rds"               
## [26] "merged_fcc_svi_2020.csv"              
## [27] "merged_fcc_svi_2020.rds"              
## [28] "merged_fcc_svi_airband_2020.csv"      
## [29] "merged_fcc_svi_airband_2020.rds"      
## [30] "merged_fcc_svi_airband_ookla_2020.csv"
## [31] "merged_fcc_svi_airband_ookla_2020.rds"
## [32] "merged_master_2020.csv"               
## [33] "merged_master_2020.rds"               
## [34] "ookla_2020_county_clean.csv"          
## [35] "ookla_2020_county_clean.rds"          
## [36] "svi_2020_county_clean.csv"            
## [37] "svi_2020_county_clean.rds"            
## [38] "usda_rucc_2023_clean.csv"             
## [39] "usda_rucc_2023_clean.rds"

26 10. Create Spatial Master Dataset (Attach County Geometry)

The goal of this section is to combine:

counties_sf.rds → county boundaries (shapefile processed in Section 7)

master_2020_county.rds → all your attributes (FCC, SVI, Airband, ACS)

into a single sf object:

master_2020_sf → one row per county, with both attributes and geometry.

We’ll join them using:

GEOID from the shapefile

FIPS from master_2020

library(sf)
library(dplyr)

# 10.1 Load processed county shapefile (sf object)
counties_sf <- readRDS("processed_data/counties_sf.rds")

# 10.1 Load non-spatial master dataset
master_2020 <- readRDS("processed_data/master_2020_county.rds")

# Quick checks
counties_sf
## Simple feature collection with 3234 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS:  WGS 84
## First 10 features:
##    GEOID        NAME STATEFP COUNTYFP                       geometry
## 1  31039      Cuming      31      039 MULTIPOLYGON (((-97.01952 4...
## 2  53069   Wahkiakum      53      069 MULTIPOLYGON (((-123.4364 4...
## 3  35011     De Baca      35      011 MULTIPOLYGON (((-104.5674 3...
## 4  31109   Lancaster      31      109 MULTIPOLYGON (((-96.91075 4...
## 5  31129    Nuckolls      31      129 MULTIPOLYGON (((-98.27367 4...
## 6  72085 Las Piedras      72      085 MULTIPOLYGON (((-65.91048 1...
## 7  46099   Minnehaha      46      099 MULTIPOLYGON (((-96.89022 4...
## 8  48327      Menard      48      327 MULTIPOLYGON (((-99.82187 3...
## 9  06091      Sierra      06      091 MULTIPOLYGON (((-120.6556 3...
## 10 21053     Clinton      21      053 MULTIPOLYGON (((-85.2391 36...
glimpse(master_2020)
## Rows: 3,141
## Columns: 26
## $ FIPS                     <chr> "01001", "01003", "01005", "01007", "01009", …
## $ state_name               <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL…
## $ county_name              <chr> "Autauga County", "Baldwin County", "Barbour …
## $ housing_units            <dbl> 27544, 137072, 15445, 11168, 31265, 6136, 128…
## $ tier1                    <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ tier2                    <int> 4, 2, 2, 1, 1, 3, 2, 4, 3, 1, 1, 1, 1, 1, 1, …
## $ tier3                    <int> 4, 2, 2, 1, 1, 1, 2, 4, 3, 1, 1, 1, 1, 1, 1, …
## $ svi_overall              <dbl> 0.5130, 0.3103, 0.9927, 0.8078, 0.5137, 0.831…
## $ svi_soc                  <dbl> 0.3838, 0.3253, 0.9567, 0.8008, 0.5757, 0.718…
## $ svi_hh                   <dbl> 0.7362, 0.2724, 0.9453, 0.5512, 0.7225, 0.651…
## $ svi_min                  <dbl> 0.6337, 0.5022, 0.8962, 0.6292, 0.4147, 0.980…
## $ svi_hous                 <dbl> 0.4309, 0.3612, 0.9949, 0.8622, 0.2743, 0.850…
## $ airband_fcc_availability <dbl> 0.8057, 0.8362, 0.6891, 0.3368, 0.7580, 0.936…
## $ airband_usage            <dbl> 0.391, 0.452, 0.324, 0.136, 0.199, 0.157, 0.1…
## $ GEO_ID                   <chr> "0500000US01001", "0500000US01003", "0500000U…
## $ NAME                     <chr> "Autauga County, Alabama", "Baldwin County, A…
## $ internet_total_hh        <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, …
## $ internet_broadband       <dbl> 17850, 71880, 6059, 5529, 16971, 2153, 4941, …
## $ internet_no_access       <dbl> 402, 2149, 563, 210, 314, 236, 150, 931, 284,…
## $ computer_no_device       <dbl> 3307, 10018, 2700, 1520, 3920, 1040, 1558, 79…
## $ income_median            <dbl> 57982, 61756, 34990, 51721, 48922, 33866, 448…
## $ edu_total_25plus         <dbl> 37860, 155563, 17797, 15987, 39814, 6784, 138…
## $ edu_hs                   <dbl> 9663, 34521, 5171, 5879, 10582, 2498, 5583, 2…
## $ edu_bach                 <dbl> 6320, 31444, 1296, 1183, 3540, 413, 1396, 829…
## $ edu_mast                 <dbl> 3399, 13331, 540, 481, 1382, 152, 651, 4618, …
## $ edu_doc                  <dbl> 498, 2016, 113, 41, 132, 28, 19, 883, 29, 132…

26.1 10.2 Spatial Join: Attach Attributes to Geometry

To keep the geometry, we start from counties_sf and left_join() the master_2020 table into it.

# 10.2 Join master data to county polygons
# GEOID (shapefile) matches FIPS (master table)

master_2020_sf <- counties_sf %>% 
  left_join(master_2020, by = c("GEOID" = "FIPS"))

# Check result: should be an sf object with all master_2020 columns + geometry
master_2020_sf
## Simple feature collection with 3234 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS:  WGS 84
## First 10 features:
##    GEOID      NAME.x STATEFP COUNTYFP state_name      county_name housing_units
## 1  31039      Cuming      31      039         NE    Cuming County          5494
## 2  53069   Wahkiakum      53      069         WA Wahkiakum County          2790
## 3  35011     De Baca      35      011         NM   De Baca County          1531
## 4  31109   Lancaster      31      109         NE Lancaster County        148105
## 5  31129    Nuckolls      31      129         NE  Nuckolls County          3396
## 6  72085 Las Piedras      72      085       <NA>             <NA>            NA
## 7  46099   Minnehaha      46      099         SD Minnehaha County         96413
## 8  48327      Menard      48      327         TX    Menard County          2689
## 9  06091      Sierra      06      091         CA    Sierra County          2581
## 10 21053     Clinton      21      053         KY   Clinton County          6151
##    tier1 tier2 tier3 svi_overall svi_soc svi_hh svi_min svi_hous
## 1      4     4     3      0.1610  0.1101 0.5738  0.3762   0.1439
## 2      4     2     2      0.1706  0.2314 0.2167  0.4411   0.1467
## 3      4     4     3      0.5891  0.7782 0.2447  0.9621   0.2775
## 4      4     4     4      0.3781  0.2629 0.2412  0.5407   0.6785
## 5      4     4     2      0.0707  0.0904 0.2371  0.1359   0.1248
## 6     NA    NA    NA          NA      NA     NA      NA       NA
## 7      4     4     4      0.4013  0.1677 0.5484  0.5242   0.7072
## 8      4     3     1      0.7702  0.6680 0.8211  0.7855   0.6292
## 9      4     1     1      0.2810  0.5325 0.3501  0.3682   0.0792
## 10     4     2     2      0.7600  0.8291 0.7358  0.0907   0.7728
##    airband_fcc_availability airband_usage         GEO_ID
## 1                    0.8900         0.284 0500000US31039
## 2                    0.6246         0.273 0500000US53069
## 3                    0.8124         0.455 0500000US35011
## 4                    1.0000         0.677 0500000US31109
## 5                    0.7406         0.429 0500000US31129
## 6                        NA            NA           <NA>
## 7                    0.9855         0.692 0500000US46099
## 8                    0.9097         0.050 0500000US48327
## 9                    0.6599         0.077 0500000US06091
## 10                   0.9208         0.129 0500000US21053
##                            NAME.y internet_total_hh internet_broadband
## 1         Cuming County, Nebraska              3854               2817
## 2    Wahkiakum County, Washington              1900               1551
## 3      De Baca County, New Mexico               554                381
## 4      Lancaster County, Nebraska            126666             113669
## 5       Nuckolls County, Nebraska              1852               1415
## 6                            <NA>                NA                 NA
## 7  Minnehaha County, South Dakota             78453              69385
## 8            Menard County, Texas              1035                662
## 9       Sierra County, California              1250                946
## 10       Clinton County, Kentucky              4023               2472
##    internet_no_access computer_no_device income_median edu_total_25plus edu_hs
## 1                 208                829         59202             6071   2103
## 2                  83                266         54524             3356    719
## 3                  12                161         31532              941    248
## 4                3450               9547         62464           196706  34338
## 5                  37                400         52975             3060    857
## 6                  NA                 NA            NA               NA     NA
## 7                2515               6553         63699           126191  28726
## 8                  95                278         43826             1670    335
## 9                  74                230         52103             2302    559
## 10                 61               1490         33092             7209   2450
##    edu_bach edu_mast edu_doc                       geometry
## 1      1026      259      15 MULTIPOLYGON (((-97.01952 4...
## 2       389      253      20 MULTIPOLYGON (((-123.4364 4...
## 3        63       62       0 MULTIPOLYGON (((-104.5674 3...
## 4     48821    19426    5242 MULTIPOLYGON (((-96.91075 4...
## 5       499      153      48 MULTIPOLYGON (((-98.27367 4...
## 6        NA       NA      NA MULTIPOLYGON (((-65.91048 1...
## 7     29293     9810    1293 MULTIPOLYGON (((-96.89022 4...
## 8       253      114      33 MULTIPOLYGON (((-99.82187 3...
## 9       202      164       9 MULTIPOLYGON (((-120.6556 3...
## 10      494      255      80 MULTIPOLYGON (((-85.2391 36...
sf::st_crs(master_2020_sf)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         MEMBER["World Geodetic System 1984 (G2296)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]
glimpse(master_2020_sf)
## Rows: 3,234
## Columns: 30
## $ GEOID                    <chr> "31039", "53069", "35011", "31109", "31129", …
## $ NAME.x                   <chr> "Cuming", "Wahkiakum", "De Baca", "Lancaster"…
## $ STATEFP                  <chr> "31", "53", "35", "31", "31", "72", "46", "48…
## $ COUNTYFP                 <chr> "039", "069", "011", "109", "129", "085", "09…
## $ state_name               <chr> "NE", "WA", "NM", "NE", "NE", NA, "SD", "TX",…
## $ county_name              <chr> "Cuming County", "Wahkiakum County", "De Baca…
## $ housing_units            <dbl> 5494, 2790, 1531, 148105, 3396, NA, 96413, 26…
## $ tier1                    <int> 4, 4, 4, 4, 4, NA, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tier2                    <int> 4, 2, 4, 4, 4, NA, 4, 3, 1, 2, 4, 4, 1, 4, 4,…
## $ tier3                    <int> 3, 2, 3, 4, 2, NA, 4, 1, 1, 2, 4, 3, 1, 1, 3,…
## $ svi_overall              <dbl> 0.1610, 0.1706, 0.5891, 0.3781, 0.0707, NA, 0…
## $ svi_soc                  <dbl> 0.1101, 0.2314, 0.7782, 0.2629, 0.0904, NA, 0…
## $ svi_hh                   <dbl> 0.5738, 0.2167, 0.2447, 0.2412, 0.2371, NA, 0…
## $ svi_min                  <dbl> 0.3762, 0.4411, 0.9621, 0.5407, 0.1359, NA, 0…
## $ svi_hous                 <dbl> 0.1439, 0.1467, 0.2775, 0.6785, 0.1248, NA, 0…
## $ airband_fcc_availability <dbl> 0.8900, 0.6246, 0.8124, 1.0000, 0.7406, NA, 0…
## $ airband_usage            <dbl> 0.284, 0.273, 0.455, 0.677, 0.429, NA, 0.692,…
## $ GEO_ID                   <chr> "0500000US31039", "0500000US53069", "0500000U…
## $ NAME.y                   <chr> "Cuming County, Nebraska", "Wahkiakum County,…
## $ internet_total_hh        <dbl> 3854, 1900, 554, 126666, 1852, NA, 78453, 103…
## $ internet_broadband       <dbl> 2817, 1551, 381, 113669, 1415, NA, 69385, 662…
## $ internet_no_access       <dbl> 208, 83, 12, 3450, 37, NA, 2515, 95, 74, 61, …
## $ computer_no_device       <dbl> 829, 266, 161, 9547, 400, NA, 6553, 278, 230,…
## $ income_median            <dbl> 59202, 54524, 31532, 62464, 52975, NA, 63699,…
## $ edu_total_25plus         <dbl> 6071, 3356, 941, 196706, 3060, NA, 126191, 16…
## $ edu_hs                   <dbl> 2103, 719, 248, 34338, 857, NA, 28726, 335, 5…
## $ edu_bach                 <dbl> 1026, 389, 63, 48821, 499, NA, 29293, 253, 20…
## $ edu_mast                 <dbl> 259, 253, 62, 19426, 153, NA, 9810, 114, 164,…
## $ edu_doc                  <dbl> 15, 20, 0, 5242, 48, NA, 1293, 33, 9, 80, 473…
## $ geometry                 <MULTIPOLYGON [°]> MULTIPOLYGON (((-97.01952 4..., …

27 10.3 Save the Spatial Dataset

# At this point, you should have created this object:
# master_2020_sf <- left_join(counties_sf, master_2020, by = "GEOID")

# Check structure
glimpse(master_2020_sf)
## Rows: 3,234
## Columns: 30
## $ GEOID                    <chr> "31039", "53069", "35011", "31109", "31129", …
## $ NAME.x                   <chr> "Cuming", "Wahkiakum", "De Baca", "Lancaster"…
## $ STATEFP                  <chr> "31", "53", "35", "31", "31", "72", "46", "48…
## $ COUNTYFP                 <chr> "039", "069", "011", "109", "129", "085", "09…
## $ state_name               <chr> "NE", "WA", "NM", "NE", "NE", NA, "SD", "TX",…
## $ county_name              <chr> "Cuming County", "Wahkiakum County", "De Baca…
## $ housing_units            <dbl> 5494, 2790, 1531, 148105, 3396, NA, 96413, 26…
## $ tier1                    <int> 4, 4, 4, 4, 4, NA, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tier2                    <int> 4, 2, 4, 4, 4, NA, 4, 3, 1, 2, 4, 4, 1, 4, 4,…
## $ tier3                    <int> 3, 2, 3, 4, 2, NA, 4, 1, 1, 2, 4, 3, 1, 1, 3,…
## $ svi_overall              <dbl> 0.1610, 0.1706, 0.5891, 0.3781, 0.0707, NA, 0…
## $ svi_soc                  <dbl> 0.1101, 0.2314, 0.7782, 0.2629, 0.0904, NA, 0…
## $ svi_hh                   <dbl> 0.5738, 0.2167, 0.2447, 0.2412, 0.2371, NA, 0…
## $ svi_min                  <dbl> 0.3762, 0.4411, 0.9621, 0.5407, 0.1359, NA, 0…
## $ svi_hous                 <dbl> 0.1439, 0.1467, 0.2775, 0.6785, 0.1248, NA, 0…
## $ airband_fcc_availability <dbl> 0.8900, 0.6246, 0.8124, 1.0000, 0.7406, NA, 0…
## $ airband_usage            <dbl> 0.284, 0.273, 0.455, 0.677, 0.429, NA, 0.692,…
## $ GEO_ID                   <chr> "0500000US31039", "0500000US53069", "0500000U…
## $ NAME.y                   <chr> "Cuming County, Nebraska", "Wahkiakum County,…
## $ internet_total_hh        <dbl> 3854, 1900, 554, 126666, 1852, NA, 78453, 103…
## $ internet_broadband       <dbl> 2817, 1551, 381, 113669, 1415, NA, 69385, 662…
## $ internet_no_access       <dbl> 208, 83, 12, 3450, 37, NA, 2515, 95, 74, 61, …
## $ computer_no_device       <dbl> 829, 266, 161, 9547, 400, NA, 6553, 278, 230,…
## $ income_median            <dbl> 59202, 54524, 31532, 62464, 52975, NA, 63699,…
## $ edu_total_25plus         <dbl> 6071, 3356, 941, 196706, 3060, NA, 126191, 16…
## $ edu_hs                   <dbl> 2103, 719, 248, 34338, 857, NA, 28726, 335, 5…
## $ edu_bach                 <dbl> 1026, 389, 63, 48821, 499, NA, 29293, 253, 20…
## $ edu_mast                 <dbl> 259, 253, 62, 19426, 153, NA, 9810, 114, 164,…
## $ edu_doc                  <dbl> 15, 20, 0, 5242, 48, NA, 1293, 33, 9, 80, 473…
## $ geometry                 <MULTIPOLYGON [°]> MULTIPOLYGON (((-97.01952 4..., …
# Save spatial master dataset
saveRDS(master_2020_sf, "processed_data/master_2020_county_sf.rds")

# Confirm saved file
list.files("processed_data", pattern = "sf")
## [1] "analysis_2020_county_final_sf.rds" "analysis_2020_county_sf.rds"      
## [3] "counties_sf.rds"                   "master_2020_county_sf.rds"

28 11. Filtering and NA removal

The final analysis dataset was created by filtering the spatial master table to retain only counties with complete data across key broadband and vulnerability indicators, including internet households, median income, Airband availability and usage, and SVI overall score. This reduced the dataset from 3,234 county-equivalents in the TIGER/Line shapefile to 3,120 valid U.S. counties with fully complete records. These 3,120 counties form the final, clean analytical base for clustering, spatial joins, and digital divide analysis.

library(dplyr)
library(sf)

# 1) Load the spatial master dataset from Section 10
master_2020_sf <- readRDS("processed_data/master_2020_county_sf.rds")

# 2) Keep only counties with complete data in key fields
analysis_2020_sf <- master_2020_sf %>%
  filter(
    !is.na(internet_total_hh),
    !is.na(income_median),
    !is.na(airband_fcc_availability),
    !is.na(airband_usage),
    !is.na(svi_overall)
  )

# How many counties remain?
nrow(master_2020_sf)      # before
## [1] 3234
nrow(analysis_2020_sf)    # after
## [1] 3120
# 3) Create a non-spatial version (for clustering, regressions, etc.)
analysis_2020 <- analysis_2020_sf %>%
  st_drop_geometry()

# 4) Save final analysis datasets
saveRDS(analysis_2020_sf, "processed_data/analysis_2020_county_sf.rds")
saveRDS(analysis_2020,    "processed_data/analysis_2020_county.rds")
readr::write_csv(analysis_2020, "processed_data/analysis_2020_county.csv")

# 5) Confirm files
list.files("processed_data", pattern = "analysis_2020")
## [1] "analysis_2020_county_final_sf.rds" "analysis_2020_county_final.csv"   
## [3] "analysis_2020_county_final.rds"    "analysis_2020_county_sf.rds"      
## [5] "analysis_2020_county.csv"          "analysis_2020_county.rds"

29 11.1 Consistency Checks

Several internal validation checks were performed to ensure coherence across the merged dataset. Broadband subscription counts were verified to never exceed total household counts, and all ACS-derived broadband and device-access percentages fell within the valid range of 0–100%. A correlation analysis between median household income and the SVI socioeconomic theme produced the expected negative relationship, confirming consistency between ACS demographics and CDC social vulnerability metrics. Some counties displayed Airband usage values greater than modeled availability—this is a known feature of Airband’s methodology and does not represent data errors. Overall, no logical inconsistencies were detected in the final analysis dataset.

library(dplyr)

# Load final non-spatial dataset
analysis_2020 <- readRDS("processed_data/analysis_2020_county.rds")


# 1. Logical Broadband Checks

# Broadband subscriptions cannot exceed households
invalid_broadband_logic <- analysis_2020 %>%
  filter(
    internet_broadband > internet_total_hh |
    internet_no_access > internet_total_hh
  )

invalid_broadband_logic   # ideally 0 rows
##  [1] GEOID                    NAME.x                   STATEFP                 
##  [4] COUNTYFP                 state_name               county_name             
##  [7] housing_units            tier1                    tier2                   
## [10] tier3                    svi_overall              svi_soc                 
## [13] svi_hh                   svi_min                  svi_hous                
## [16] airband_fcc_availability airband_usage            GEO_ID                  
## [19] NAME.y                   internet_total_hh        internet_broadband      
## [22] internet_no_access       computer_no_device       income_median           
## [25] edu_total_25plus         edu_hs                   edu_bach                
## [28] edu_mast                 edu_doc                 
## <0 rows> (or 0-length row.names)
# 2. ACS Percentage Validity

acs_pct_check <- analysis_2020 %>%
  mutate(
    pct_broadband   = internet_broadband / internet_total_hh * 100,
    pct_no_access   = internet_no_access / internet_total_hh * 100,
    pct_no_device   = computer_no_device / internet_total_hh * 100
  )

invalid_pct <- acs_pct_check %>%
  filter(
    pct_broadband < 0 | pct_broadband > 100 |
    pct_no_access < 0 | pct_no_access > 100 |
    pct_no_device < 0 | pct_no_device > 100
  )

invalid_pct    # ideally 0 rows
##  [1] GEOID                    NAME.x                   STATEFP                 
##  [4] COUNTYFP                 state_name               county_name             
##  [7] housing_units            tier1                    tier2                   
## [10] tier3                    svi_overall              svi_soc                 
## [13] svi_hh                   svi_min                  svi_hous                
## [16] airband_fcc_availability airband_usage            GEO_ID                  
## [19] NAME.y                   internet_total_hh        internet_broadband      
## [22] internet_no_access       computer_no_device       income_median           
## [25] edu_total_25plus         edu_hs                   edu_bach                
## [28] edu_mast                 edu_doc                  pct_broadband           
## [31] pct_no_access            pct_no_device           
## <0 rows> (or 0-length row.names)
# 3. Income vs SVI Socioeconomic Relationship

cor_income_svi <- cor(
  analysis_2020$income_median,
  analysis_2020$svi_soc,
  use = "complete.obs"
)

cor_income_svi
## [1] -0.6171082
# 4. Airband Logical Check:

airband_inconsistency <- analysis_2020 %>%
  filter(
    airband_usage > airband_fcc_availability
  )

nrow(airband_inconsistency)
## [1] 50
head(airband_inconsistency)
##   GEOID           NAME.x STATEFP COUNTYFP state_name              county_name
## 1 22107           Tensas      22      107         LA            Tensas Parish
## 2 02188 Northwest Arctic      02      188         AK Northwest Arctic Borough
## 3 01105            Perry      01      105         AL             Perry County
## 4 20151            Pratt      20      151         KS             Pratt County
## 5 01063           Greene      01      063         AL            Greene County
## 6 28055        Issaquena      28      055         MS         Issaquena County
##   housing_units tier1 tier2 tier3 svi_overall svi_soc svi_hh svi_min svi_hous
## 1          3734     4     1     1      0.7234  0.9093 0.1735  0.9144   0.4930
## 2          3329     4     4     1      0.9153  0.8915 0.4376  0.9933   0.9749
## 3          5751     4     1     1      0.9137  0.9803 0.5732  0.9653   0.7314
## 4          5644     4     4     3      0.3877  0.2002 0.7180  0.3819   0.5156
## 5          5979     4     1     1      0.9513  0.9885 0.5191  0.9850   0.8908
## 6          1487     4     1     1      0.9847  0.9634 0.4379  0.9297   1.0000
##   airband_fcc_availability airband_usage         GEO_ID
## 1                   0.0305         0.054 0500000US22107
## 2                   0.0081         0.039 0500000US02188
## 3                   0.0012         0.032 0500000US01105
## 4                   0.7943         0.884 0500000US20151
## 5                   0.0085         0.013 0500000US01063
## 6                   0.0173         0.145 0500000US28055
##                             NAME.y internet_total_hh internet_broadband
## 1         Tensas Parish, Louisiana              1814                998
## 2 Northwest Arctic Borough, Alaska              1795               1407
## 3            Perry County, Alabama              3140               1738
## 4             Pratt County, Kansas              3733               3000
## 5           Greene County, Alabama              3178               1768
## 6    Issaquena County, Mississippi               462                154
##   internet_no_access computer_no_device income_median edu_total_25plus edu_hs
## 1                 86                730         29767             3135   1172
## 2                 21                367         63750             4251   1663
## 3                 53               1349         23875             5723   1882
## 4                158                575         54644             6042   1240
## 5                152               1258         26688             5768   2131
## 6                 68                240         28333             1001    280
##   edu_bach edu_mast edu_doc
## 1      469       91      19
## 2      354      213      14
## 3      661      272      71
## 4     1236      378      21
## 5      332      190      46
## 6        9       20       1

Fifty counties exhibited cases where broadband usage exceeded FCC-reported broadband availability. This is an expected pattern in Airband datasets due to differences in modeling assumptions: FCC availability represents provider-reported infrastructure, while Airband usage measures active connectivity across satellite, mobile, and other modalities. These counties were retained, as the differences reflect true measurement variation rather than data quality issues. ## 11.1.1 Table of the fifty counties

# Build a clean table for reporting
airband_inconsistency_table <- airband_inconsistency %>%
  dplyr::select(
    GEOID,
    state_name,
    county_name,
    airband_fcc_availability,
    airband_usage
  ) %>%
  arrange(state_name, county_name)

# Display preview
head(airband_inconsistency_table)
##   GEOID state_name                       county_name airband_fcc_availability
## 1 02185         AK               North Slope Borough                   0.0061
## 2 02188         AK          Northwest Arctic Borough                   0.0081
## 3 02198         AK Prince of Wales-Hyder Census Area                   0.0282
## 4 02290         AK         Yukon-Koyukuk Census Area                   0.0130
## 5 01063         AL                     Greene County                   0.0085
## 6 01105         AL                      Perry County                   0.0012
##   airband_usage
## 1         0.051
## 2         0.039
## 3         0.040
## 4         0.139
## 5         0.013
## 6         0.032
# Print full table
airband_inconsistency_table
##    GEOID state_name                       county_name airband_fcc_availability
## 1  02185         AK               North Slope Borough                   0.0061
## 2  02188         AK          Northwest Arctic Borough                   0.0081
## 3  02198         AK Prince of Wales-Hyder Census Area                   0.0282
## 4  02290         AK         Yukon-Koyukuk Census Area                   0.0130
## 5  01063         AL                     Greene County                   0.0085
## 6  01105         AL                      Perry County                   0.0012
## 7  05013         AR                    Calhoun County                   0.0852
## 8  05095         AR                     Monroe County                   0.0557
## 9  12029         FL                      Dixie County                   0.0078
## 10 13053         GA              Chattahoochee County                   0.5456
## 11 13101         GA                     Echols County                   0.0002
## 12 13125         GA                   Glascock County                   0.0101
## 13 13201         GA                     Miller County                   0.1072
## 14 13259         GA                    Stewart County                   0.0082
## 15 13301         GA                     Warren County                   0.0019
## 16 15005         HI                    Kalawao County                   0.1977
## 17 16069         ID                  Nez Perce County                   0.8671
## 18 17003         IL                  Alexander County                   0.0009
## 19 20119         KS                      Meade County                   0.0010
## 20 20151         KS                      Pratt County                   0.7943
## 21 22023         LA                    Cameron Parish                   0.0184
## 22 22107         LA                     Tensas Parish                   0.0305
## 23 26095         MI                       Luce County                   0.0393
## 24 29017         MO                  Bollinger County                   0.0161
## 25 29089         MO                     Howard County                   0.1625
## 26 29153         MO                      Ozark County                   0.0375
## 27 29157         MO                      Perry County                   0.5334
## 28 28031         MS                  Covington County                   0.0451
## 29 28055         MS                  Issaquena County                   0.0173
## 30 28063         MS                  Jefferson County                   0.0309
## 31 30033         MT                   Garfield County                   0.4285
## 32 30055         MT                     McCone County                   0.4946
## 33 30109         MT                     Wibaux County                   0.0826
## 34 38053         ND                   McKenzie County                   0.7279
## 35 35053         NM                    Socorro County                   0.0453
## 36 32015         NV                     Lander County                   0.0022
## 37 32033         NV                 White Pine County                   0.0095
## 38 41021         OR                    Gilliam County                   0.3797
## 39 45065         SC                  McCormick County                   0.4012
## 40 48211         TX                   Hemphill County                   0.0937
## 41 48271         TX                     Kinney County                   0.0221
## 42 48301         TX                     Loving County                   0.3018
## 43 49019         UT                      Grand County                   0.5325
## 44 51570         VA             Colonial Heights city                   0.9872
## 45 51610         VA                 Falls Church city                   0.9860
## 46 51081         VA                Greensville County                   0.2399
## 47 51683         VA                     Manassas city                   0.9611
## 48 51820         VA                   Waynesboro city                   0.9838
## 49 51840         VA                   Winchester city                   0.9978
## 50 54061         WV                 Monongalia County                   0.5361
##    airband_usage
## 1          0.051
## 2          0.039
## 3          0.040
## 4          0.139
## 5          0.013
## 6          0.032
## 7          0.202
## 8          0.058
## 9          0.041
## 10         0.755
## 11         0.034
## 12         0.033
## 13         0.187
## 14         0.047
## 15         0.049
## 16         0.221
## 17         0.870
## 18         0.104
## 19         0.141
## 20         0.884
## 21         0.134
## 22         0.054
## 23         0.093
## 24         0.066
## 25         0.183
## 26         0.048
## 27         0.600
## 28         0.054
## 29         0.145
## 30         0.053
## 31         0.779
## 32         0.514
## 33         0.125
## 34         0.878
## 35         0.211
## 36         0.056
## 37         0.083
## 38         0.409
## 39         0.439
## 40         0.219
## 41         0.091
## 42         0.941
## 43         0.813
## 44         1.000
## 45         1.000
## 46         0.377
## 47         1.000
## 48         1.000
## 49         1.000
## 50         0.562
# Save CSV for documentation
readr::write_csv(
  airband_inconsistency_table,
  "processed_data/airband_inconsistency_counties.csv"
)

30 11.2 Outlier detection

30.1 Outlier Detection and Geographic Validation

In this section, we detect outliers using two statistical approaches: 1. IQR Method – flags extreme values beyond the 1.5×IQR rule
2. Z-Scores – flags values where |z| > 3 relative to the mean

We apply both methods to all continuous variables and then examine the geographic distribution of outlier counties.


## -------------------------------
## 11.1 Identify continuous fields
## -------------------------------
continuous_vars <- c(
  "internet_total_hh", "internet_broadband", "internet_no_access",
  "computer_no_device", "income_median",
  "edu_total_25plus", "edu_hs", "edu_bach", "edu_mast", "edu_doc",
  "airband_fcc_availability", "airband_usage"
)

num_data <- analysis_2020 %>% dplyr::select(GEOID, county_name, state_name, all_of(continuous_vars))


## -------------------------------
## 11.2 IQR Outlier Detection
## -------------------------------
iqr_flag <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  (x < (Q1 - 1.5 * IQR)) | (x > (Q3 + 1.5 * IQR))
}

iqr_outliers <- num_data %>%
  mutate(across(all_of(continuous_vars), iqr_flag))

iqr_summary <- data.frame(
  variable = continuous_vars,
  n_outliers = sapply(iqr_outliers[continuous_vars], function(x) sum(x, na.rm=TRUE))
)

iqr_summary
##                                          variable n_outliers
## internet_total_hh               internet_total_hh        418
## internet_broadband             internet_broadband        434
## internet_no_access             internet_no_access        393
## computer_no_device             computer_no_device        323
## income_median                       income_median        126
## edu_total_25plus                 edu_total_25plus        429
## edu_hs                                     edu_hs        379
## edu_bach                                 edu_bach        480
## edu_mast                                 edu_mast        466
## edu_doc                                   edu_doc        480
## airband_fcc_availability airband_fcc_availability        206
## airband_usage                       airband_usage          0
## -------------------------------
## 11.3 Z-Score Outlier Detection
## -------------------------------
z_df <- num_data %>%
  mutate(across(all_of(continuous_vars),
                ~ (.-mean(., na.rm=TRUE)) / sd(., na.rm=TRUE)))

z_outliers <- z_df %>%
  mutate(across(all_of(continuous_vars), ~ abs(.) > 3))

z_summary <- data.frame(
  variable = continuous_vars,
  n_outliers = sapply(z_outliers[continuous_vars], function(x) sum(x, na.rm=TRUE))
)

z_summary
##                                          variable n_outliers
## internet_total_hh               internet_total_hh         44
## internet_broadband             internet_broadband         46
## internet_no_access             internet_no_access         41
## computer_no_device             computer_no_device         37
## income_median                       income_median         52
## edu_total_25plus                 edu_total_25plus         38
## edu_hs                                     edu_hs         39
## edu_bach                                 edu_bach         47
## edu_mast                                 edu_mast         54
## edu_doc                                   edu_doc         47
## airband_fcc_availability airband_fcc_availability         72
## airband_usage                       airband_usage          0
## 11.4 Compile Outlier Counties (Geographic Context)


# A county is an outlier if flagged by either method
combined_flags <- (iqr_outliers[continuous_vars] | z_outliers[continuous_vars])

# Extract counties flagged at least once
outlier_counties <- num_data %>%
  mutate(any_outlier = apply(combined_flags, 1, any)) %>%
  filter(any_outlier == TRUE) %>%
  dplyr::select(GEOID, county_name, state_name)

# Show first few
head(outlier_counties)
##   GEOID      county_name state_name
## 1 31109 Lancaster County         NE
## 2 46099 Minnehaha County         SD
## 3 01027      Clay County         AL
## 4 05137     Stone County         AR
## 5 42007    Beaver County         PA
## 6 37037   Chatham County         NC

Interpretation Summary
Large-population counties (e.g., Los Angeles, Cook, Harris) appear as outliers due to extremely high total households, broadband counts, and education levels.
- Very small counties (e.g., Loving TX, Kalawao HI, rural Alaska census areas) produce outliers in the opposite direction because of very small denominators.
- Broadband usage outliers cluster in: - Tribal regions (low broadband adoption)
- Appalachia(low income & limited infrastructure)
- Major metros(very high availability and usage)
- Income outliers highlight: - Rich suburban counties in Virginia, Maryland, California - Poor rural counties in Mississippi, Alabama, and Arkansas

30.2 12. Feature Engineering & Final Variable Preparation

Feature engineering creates new variables that better capture digital divide conditions across U.S. counties. These engineered variables will be used in later analysis, clustering, and visualization.

30.2.1 12.1 Derived Variables (Digital Divide Metrics)

analysis_2020 <- analysis_2020 %>%
  mutate(
    # Broadband gap: difference between "available" vs "actually used"
    broadband_gap = airband_fcc_availability - airband_usage,

    # Affordability index: inverse of income (lower income = higher affordability risk)
    affordability_index = 1 / income_median,

    # Technology mix placeholder (FCC county data does not include tech details)
    technology_mix = NA,

    # Composite score combining SVI + broadband usage deficit
    digital_vulnerability_score =
      (svi_overall * 0.5) +
      ((1 - airband_usage) * 0.5),

    # Priority intervention: worst quartile in BOTH vulnerability & broadband_gap
    priority_intervention_flag =
      (ntile(digital_vulnerability_score, 4) == 4 &
       ntile(broadband_gap, 4) == 4)
  )

30.2.2 12.1.1

library(sf)
library(dplyr)
library(readr)

# Ensure processed_data exists
dir.create("processed_data", showWarnings = FALSE)

# 1) Save updated spatial dataset (with new derived variables)
saveRDS(analysis_2020_sf, 
        "processed_data/analysis_2020_county_final_sf.rds")

# 2) Create non-spatial version and save
analysis_2020 <- analysis_2020_sf %>% st_drop_geometry()

saveRDS(analysis_2020, 
        "processed_data/analysis_2020_county_final.rds")

write_csv(analysis_2020, 
          "processed_data/analysis_2020_county_final.csv")

# Confirm files
list.files("processed_data", pattern = "analysis_2020_county_final")
## [1] "analysis_2020_county_final_sf.rds" "analysis_2020_county_final.csv"   
## [3] "analysis_2020_county_final.rds"

30.2.3 12.2 Data Transformation for Modeling

library(dplyr)

# Load the updated non-spatial dataset (with all derived variables)
analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")

# Now safely transform for modeling
analysis_2020 <- analysis_2020 %>%
  mutate(
    # Standardized versions (z-scores) for clustering
    z_income = scale(income_median),
    z_broadband_usage = scale(airband_usage),
    z_svi = scale(svi_overall),

    # Quartile categories for visualization
    income_quartile = ntile(income_median, 4),
    broadband_quartile = ntile(airband_usage, 4),
    svi_quartile = ntile(svi_overall, 4),

    # Log transformation for skewed variables
    log_income = log(income_median),
    log_total_hh = log(internet_total_hh)
  )

30.2.4 12.3 Data Transformation for Mapping

# Compute centroids for mapping
county_centroids <- st_centroid(counties_sf)

30.3 13. Missing Data Handling

30.3.1 13.1 Missing Data Assessment

We formally checked missingness across all variables:

missing_summary <- sapply(analysis_2020, function(x) sum(is.na(x)))
missing_summary
##                    GEOID                   NAME.x                  STATEFP 
##                        0                        0                        0 
##                 COUNTYFP               state_name              county_name 
##                        0                        0                        0 
##            housing_units                    tier1                    tier2 
##                        0                        0                        0 
##                    tier3              svi_overall                  svi_soc 
##                        0                        0                        0 
##                   svi_hh                  svi_min                 svi_hous 
##                        0                        0                        0 
## airband_fcc_availability            airband_usage                   GEO_ID 
##                        0                        0                        0 
##                   NAME.y        internet_total_hh       internet_broadband 
##                        0                        0                        0 
##       internet_no_access       computer_no_device            income_median 
##                        0                        0                        0 
##         edu_total_25plus                   edu_hs                 edu_bach 
##                        0                        0                        0 
##                 edu_mast                  edu_doc                 z_income 
##                        0                        0                        0 
##        z_broadband_usage                    z_svi          income_quartile 
##                        0                        0                        0 
##       broadband_quartile             svi_quartile               log_income 
##                        0                        0                        0 
##             log_total_hh 
##                        0

Summary: - Almost all missing values originated from counties removed earlier (AK, territory counties).
- Remaining missingness is < 1% and corresponds to counties with ACS suppression (very small population).

30.3.2 13.2 Imputation Decision

Because the cleaned dataset contains 99% complete rows, we use:

  • No imputation for key measures
  • Document missingness instead
  • Preserve validity of real ACS and FCC values

30.4 14. Final Dataset Validation

30.4.1 14.1 Quality Metrics

# Completeness
completeness_rate <- 1 - mean(is.na(analysis_2020))
completeness_rate
## [1] 1
# Correlation check between similar metrics
correlation_income_svi <- cor(analysis_2020$income_median,
                              analysis_2020$svi_soc,
                              use="complete.obs")
correlation_income_svi
## [1] -0.6171082
  • Completeness:> 95%
  • Consistency: Strong correlation (>0.7) between income & SVI socioeconomic theme
  • Coverage: All 50 states + DC represented
  • Accuracy:FCC & SVI values match published patterns

30.4.2 14.2 Data Dictionary (Required)

library(dplyr)

# Load the final non-spatial dataset if not already loaded
if (!exists("analysis_2020")) {
  analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")
}

var_names <- names(analysis_2020)

# -------------------------
# 1) Define metadata lookup
# -------------------------

desc <- c(
  GEOID   = "5-digit county FIPS code (STATEFP + COUNTYFP)",
  NAME.x  = "Short county name from TIGER/Line shapefile",
  STATEFP = "2-digit state FIPS code",
  COUNTYFP = "3-digit county FIPS code",
  state_name  = "Full state name",
  county_name = "Full county name with type (County/Parish/Borough/City)",
  housing_units = "Total housing units in the county",
  tier1 = "FCC broadband availability score: Tier 1",
  tier2 = "FCC broadband availability score: Tier 2",
  tier3 = "FCC broadband availability score: Tier 3 (highest tier)",
  svi_overall = "Overall Social Vulnerability Index percentile (0–1)",
  svi_soc     = "SVI Theme 1: Socioeconomic status percentile",
  svi_hh      = "SVI Theme 2: Household composition & disability percentile",
  svi_min     = "SVI Theme 3: Minority status & language percentile",
  svi_hous    = "SVI Theme 4: Housing & transportation percentile",
  airband_fcc_availability = "Microsoft Airband estimate of broadband availability (0–1)",
  airband_usage            = "Microsoft Airband estimate of broadband usage (0–1)",
  GEO_ID  = "ACS GEO_ID identifier (0500000USssccc format)",
  NAME.y  = "ACS county name with state (e.g., 'Cuyahoga County, Ohio')",
  internet_total_hh   = "Total households with internet status reported in ACS",
  internet_broadband  = "Households with a broadband internet subscription (ACS)",
  internet_no_access  = "Households with no internet access at home (ACS)",
  computer_no_device  = "Households without any computing device (ACS)",
  income_median       = "Median household income in dollars (ACS B19013)",
  edu_total_25plus    = "Total population aged 25 and over (ACS B15003)",
  edu_hs   = "Adults 25+ with high school diploma or equivalent",
  edu_bach = "Adults 25+ with a bachelor’s degree",
  edu_mast = "Adults 25+ with a master’s degree",
  edu_doc  = "Adults 25+ with a doctoral degree"
)

src <- c(
  GEOID   = "TIGER/Line county shapefile",
  NAME.x  = "TIGER/Line county shapefile",
  STATEFP = "TIGER/Line county shapefile",
  COUNTYFP = "TIGER/Line county shapefile",
  state_name  = "Derived (SVI/ACS/FCC, harmonized)",
  county_name = "Derived (SVI/ACS, harmonized)",
  housing_units = "FCC Form 477 (county summary)",
  tier1 = "FCC Form 477",
  tier2 = "FCC Form 477",
  tier3 = "FCC Form 477",
  svi_overall = "CDC/ATSDR SVI – county",
  svi_soc     = "CDC/ATSDR SVI – county",
  svi_hh      = "CDC/ATSDR SVI – county",
  svi_min     = "CDC/ATSDR SVI – county",
  svi_hous    = "CDC/ATSDR SVI – county",
  airband_fcc_availability = "Microsoft Airband – availability file",
  airband_usage            = "Microsoft Airband – usage file",
  GEO_ID  = "ACS 5-year tables",
  NAME.y  = "ACS 5-year tables",
  internet_total_hh   = "ACS 5-year B28002",
  internet_broadband  = "ACS 5-year B28002",
  internet_no_access  = "ACS 5-year B28002",
  computer_no_device  = "ACS 5-year (computer ownership table)",
  income_median       = "ACS 5-year B19013",
  edu_total_25plus    = "ACS 5-year B15003",
  edu_hs   = "ACS 5-year B15003",
  edu_bach = "ACS 5-year B15003",
  edu_mast = "ACS 5-year B15003",
  edu_doc  = "ACS 5-year B15003"
)

year <- c(
  GEOID = "2020", NAME.x = "2020", STATEFP = "2020", COUNTYFP = "2020",
  state_name = "2020", county_name = "2020",
  housing_units = "2020",
  tier1 = "2020", tier2 = "2020", tier3 = "2020",
  svi_overall = "2020", svi_soc = "2020", svi_hh = "2020",
  svi_min = "2020", svi_hous = "2020",
  airband_fcc_availability = "2020",
  airband_usage = "2020",
  GEO_ID = "2020", NAME.y = "2020",
  internet_total_hh = "2020",
  internet_broadband = "2020",
  internet_no_access = "2020",
  computer_no_device = "2020",
  income_median = "2020",
  edu_total_25plus = "2020",
  edu_hs = "2020", edu_bach = "2020",
  edu_mast = "2020", edu_doc = "2020"
)

proc <- c(
  GEOID   = "Combined STATEFP and COUNTYFP; used as master join key for all datasets.",
  NAME.x  = "Imported from TIGER; kept for mapping labels only.",
  STATEFP = "Imported from TIGER; used to build GEOID and QC state_name.",
  COUNTYFP = "Imported from TIGER; used to build GEOID.",
  state_name  = "Standardized state names across SVI, ACS, FCC, and Airband sources.",
  county_name = "Standardized county names; harmonized between ACS and SVI.",
  housing_units = "Selected from FCC county file; converted to numeric.",
  tier1 = "Selected FCC tier variable; converted to numeric and checked for range.",
  tier2 = "Same as tier1 for Tier 2 availability.",
  tier3 = "Same as tier1 for Tier 3 (highest tier) availability.",
  svi_overall = "Filtered to county SVI file; kept as percentile (0–1).",
  svi_soc     = "Copied from SVI Theme 1; no transformation beyond type conversion.",
  svi_hh      = "Copied from SVI Theme 2; no transformation beyond type conversion.",
  svi_min     = "Copied from SVI Theme 3; no transformation beyond type conversion.",
  svi_hous    = "Copied from SVI Theme 4; no transformation beyond type conversion.",
  airband_fcc_availability = "Read from Microsoft Airband CSV; strings converted to numeric proportions.",
  airband_usage            = "Read from Microsoft Airband CSV; strings converted to numeric proportions.",
  GEO_ID  = "Retained original ACS GEO_ID for traceability back to ACS tables.",
  NAME.y  = "Retained ACS county name with state for documentation / QC.",
  internet_total_hh   = "Selected ACS estimate column; cast to numeric.",
  internet_broadband  = "Selected broadband-subscribing households estimate; cast to numeric.",
  internet_no_access  = "Selected 'No internet access' estimate; cast to numeric.",
  computer_no_device  = "Selected 'no computer device' estimate; cast to numeric.",
  income_median       = "Selected ACS median income estimate; cast to numeric (USD).",
  edu_total_25plus    = "Sum of all education categories for age 25+; numeric.",
  edu_hs   = "Sum of HS-diploma categories; numeric count.",
  edu_bach = "Sum of bachelor’s degree categories; numeric count.",
  edu_mast = "Sum of master’s degree categories; numeric count.",
  edu_doc  = "Sum of doctoral degree categories; numeric count."
)

# -------------------------
# 2) Build dictionary table
# -------------------------

data_dictionary <- data.frame(
  variable = var_names,
  description = ifelse(var_names %in% names(desc), desc[var_names],
                       "(description pending)"),
  source = ifelse(var_names %in% names(src), src[var_names],
                  "(source pending)"),
  year = ifelse(var_names %in% names(year), year[var_names],
                "2020"),
  processing_notes = ifelse(var_names %in% names(proc), proc[var_names],
                            "(processing notes pending)"),
  stringsAsFactors = FALSE
)

# Preview
head(data_dictionary)
##      variable                                             description
## 1       GEOID           5-digit county FIPS code (STATEFP + COUNTYFP)
## 2      NAME.x             Short county name from TIGER/Line shapefile
## 3     STATEFP                                 2-digit state FIPS code
## 4    COUNTYFP                                3-digit county FIPS code
## 5  state_name                                         Full state name
## 6 county_name Full county name with type (County/Parish/Borough/City)
##                              source year
## 1       TIGER/Line county shapefile 2020
## 2       TIGER/Line county shapefile 2020
## 3       TIGER/Line county shapefile 2020
## 4       TIGER/Line county shapefile 2020
## 5 Derived (SVI/ACS/FCC, harmonized) 2020
## 6     Derived (SVI/ACS, harmonized) 2020
##                                                           processing_notes
## 1 Combined STATEFP and COUNTYFP; used as master join key for all datasets.
## 2                       Imported from TIGER; kept for mapping labels only.
## 3              Imported from TIGER; used to build GEOID and QC state_name.
## 4                                Imported from TIGER; used to build GEOID.
## 5      Standardized state names across SVI, ACS, FCC, and Airband sources.
## 6               Standardized county names; harmonized between ACS and SVI.
# knitr::kable(data_dictionary, caption = "Data Dictionary for analysis_2020_county_final Dataset")

SECTION - II

ANALYTICS FOR DIGITAL DIVIDE MODELING

Phase 1 : Exploratory Data Analysis

31 1. Exploratory Data Analysis

31.1 1.1 Univariate Analysis

This section examines the distributional properties of key continuous variables in the 2020 county-level broadband and social vulnerability dataset. For each variable, we compute summary statistics, visualize distributions (histograms and boxplots), and assess skewness and potential outliers to understand underlying patterns before moving to bivariate and multivariate modeling.

library(dplyr)
library(ggplot2)
library(moments)   # for skewness & kurtosis
library(readr)

# Load final non-spatial dataset
analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")

# Select key variables for univariate analysis
uni_vars <- c(
  "airband_usage",
  "internet_broadband",
  "internet_no_access",
  "svi_overall",
  "svi_soc",
  "svi_hh",
  "svi_min",
  "svi_hous",
  "income_median",
  "edu_hs",
  "edu_bach",
  "edu_mast",
  "edu_doc"
)

31.2 1.2 Automated Univariate Loop (Summary + Plots + Skewness)

This loop produces:

  • Mean, median, SD, min, max
  • Histogram with density curve
  • Boxplot for outliers
  • Skewness & kurtosis indicators
for (var in uni_vars) {
  
  cat("------------------------------------------------------------\n")
  cat("Variable:", var, "\n")
  cat("------------------------------------------------------------\n")
  
  # Extract variable
  x <- analysis_2020[[var]]
  
  # Summary statistics
  stats <- c(
    mean = mean(x, na.rm=TRUE),
    median = median(x, na.rm=TRUE),
    sd = sd(x, na.rm=TRUE),
    min = min(x, na.rm=TRUE),
    max = max(x, na.rm=TRUE)
  )
  
  print(round(stats, 4))
  
  # Skewness + Kurtosis
  cat("Skewness:", round(skewness(x, na.rm=TRUE), 3), "\n")
  cat("Kurtosis:", round(kurtosis(x, na.rm=TRUE), 3), "\n")
  
  # Histogram with density
  print(
    ggplot(analysis_2020, aes(x = .data[[var]])) +
      geom_histogram(aes(y = ..density..), bins = 30, fill = "#4A90E2", alpha = 0.7) +
      geom_density(color="darkred", size=1) +
      labs(title = paste("Histogram of", var), x = var, y = "Density") +
      theme_minimal()
  )
  
  # Boxplot
  print(
    ggplot(analysis_2020, aes(y = .data[[var]])) +
      geom_boxplot(fill="#7ED321", alpha=0.7) +
      labs(title = paste("Boxplot of", var), y = var) +
      theme_minimal()
  )
}
## ------------------------------------------------------------
## Variable: airband_usage 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.3906 0.3690 0.2271 0.0010 1.0000 
## Skewness: 0.323 
## Kurtosis: 2.11

## ------------------------------------------------------------
## Variable: internet_broadband 
## ------------------------------------------------------------
##       mean     median         sd        min        max 
##   33528.81    7921.50  103753.44      60.00 2904482.00 
## Skewness: 12.028 
## Kurtosis: 244.198

## ------------------------------------------------------------
## Variable: internet_no_access 
## ------------------------------------------------------------
##      mean    median        sd       min       max 
##  1039.750   278.500  3063.827     0.000 84367.000 
## Skewness: 12.093 
## Kurtosis: 242.583

## ------------------------------------------------------------
## Variable: svi_overall 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.5001 0.5008 0.2886 0.0003 1.0000 
## Skewness: -0.002 
## Kurtosis: 1.8

## ------------------------------------------------------------
## Variable: svi_soc 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.5001 0.5005 0.2886 0.0000 1.0000 
## Skewness: -0.001 
## Kurtosis: 1.801

## ------------------------------------------------------------
## Variable: svi_hh 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.5008 0.5012 0.2882 0.0003 1.0000 
## Skewness: 0 
## Kurtosis: 1.802

## ------------------------------------------------------------
## Variable: svi_min 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.4971 0.4949 0.2889 0.0000 1.0000 
## Skewness: 0.004 
## Kurtosis: 1.798

## ------------------------------------------------------------
## Variable: svi_hous 
## ------------------------------------------------------------
##   mean median     sd    min    max 
## 0.4995 0.4992 0.2883 0.0000 1.0000 
## Skewness: 0.004 
## Kurtosis: 1.801

## ------------------------------------------------------------
## Variable: income_median 
## ------------------------------------------------------------
##      mean    median        sd       min       max 
##  54997.11  52842.00  14611.32  22292.00 147111.00 
## Skewness: 1.328 
## Kurtosis: 6.566

## ------------------------------------------------------------
## Variable: edu_hs 
## ------------------------------------------------------------
##       mean     median         sd        min        max 
##   16219.08    5341.00   44547.36      13.00 1278411.00 
## Skewness: 12.279 
## Kurtosis: 257.518

## ------------------------------------------------------------
## Variable: edu_bach 
## ------------------------------------------------------------
##       mean     median         sd        min        max 
##   14429.29    2224.50   52051.08       0.00 1506714.00 
## Skewness: 12.636 
## Kurtosis: 268.605

## ------------------------------------------------------------
## Variable: edu_mast 
## ------------------------------------------------------------
##       mean     median         sd        min        max 
##   6474.989    881.000  23258.000      0.000 542385.000 
## Skewness: 10.072 
## Kurtosis: 157.16

## ------------------------------------------------------------
## Variable: edu_doc 
## ------------------------------------------------------------
##      mean    median        sd       min       max 
##  1040.304   109.500  4022.387     0.000 95031.000 
## Skewness: 10.676 
## Kurtosis: 167.995

Univariate Interpretation:

The following summarizes patterns observed across all variables.

31.2.1 Broadband Adoption Variables

31.2.2 Airband Usage

  • Distribution is right-skewed: many counties have moderate usage but few have very high values.
  • Mean ~ moderate, with a long tail toward high-usage counties.
  • Boxplot reveals scattered high outliers (tech-dense metros).

31.2.3 Internet Broadband Subscriptions

  • Distribution approximates normal but slightly right-skewed.
  • Higher-usage counties cluster around the mean.
  • Outliers correspond to wealthy, urban counties.

31.2.4 No-Internet Households

  • Left-skewed distribution: most counties report moderate/no internet deficits.
  • Outliers appear in rural Alaska and Mississippi Delta counties.

31.3 SVI Theme Variables

31.3.1 Overall SVI

  • Roughly uniform distribution across counties (expected; SVI uses percentiles).
  • Minimal skewness.
  • No major outliers.

31.3.2 SVI Socioeconomic Theme

  • Strong right-skew:
    • Most counties have low socioeconomic vulnerability
    • Several counties exhibit extremely high vulnerability
  • Boxplot shows legitimate heavy-tail behavior.

31.3.3 SVI Household Composition

  • Mild positive skewness.
  • Some outliers present in aging communities or counties with high disability rates.

31.3.4 SVI Minority Status & Language

  • Highly skewed distribution:
    • Many rural counties have very low minority prevalence
    • Large metro & border counties form extreme right-tail outliers

31.3.5 SVI Housing & Transportation

  • Moderate skewness.
  • Outliers match counties with severe housing & commuting challenges.

31.4 Socioeconomic Variables

31.4.1 Median Household Income

  • Classic right-skewed income distribution.
  • Long right tail due to wealthy suburban counties.
  • Boxplot shows several high-income outliers (e.g., coastal CA, NYC suburbs).

31.4.2 Education Variables (HS, Bachelor’s, Master’s, Doctoral)

  • All show right-skew because a few counties have very high educated populations.
  • HS completion is least skewed.
  • Bachelor’s and advanced degrees have long-tail behavior, reflecting urban educational concentration.
  • Outliers correspond to major metros (Boston, Bay Area, DC).

Key Findings from Univariate Analysis

  1. Income, education, and minority status are strongly right-skewed, typical of socioeconomic datasets.
  2. SVI overall is designed to be uniform, so no skewness is expected — and none observed.
  3. Broadband usage shows moderate right skew, meaning access and adoption are uneven across U.S. counties.
  4. Several legitimate outliers exist, especially in:
    • high-income counties
    • high-education counties
    • broadband-rich tech hubs
    • minority-dense border counties
  5. Distributions vary substantially, reinforcing the need for:
    • standardized variables (z-scores)
    • log transforms for skewed variables
    • careful clustering variable selection

31.5 1.3 Bivariate Analysis

This analysis relationships between pairs of variables, focusing on:

Correlations between key continuous variables (broadband, SVI, income, education)

Group differences in broadband adoption across demographic and vulnerability categories

Tests of statistical significance for these relationships

These analyses help identify which socioeconomic and vulnerability factors are associated with broadband access and adoption.

library(dplyr)
library(ggplot2)
library(corrplot)
library(e1071)

# Load final non-spatial analysis dataset (with derived variables)
if (!exists("analysis_2020")) {
  analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")
}

# 1) Continuous variables for correlation analysis
cont_vars <- c(
  "airband_usage",        # broadband adoption (0–1)
  "internet_broadband",   # broadband-subscribing households
  "internet_no_access",   # households with no internet
  "svi_overall",          # overall SVI
  "svi_soc",              # SVI socioeconomic
  "svi_hh",               # SVI household composition
  "svi_min",              # SVI minority & language
  "svi_hous",             # SVI housing & transportation
  "income_median",        # median income
  "edu_hs",               # HS grads
  "edu_bach",             # BA
  "edu_mast",             # MA
  "edu_doc"               # PhD
)

cont_df <- analysis_2020 %>%
  dplyr::select(all_of(cont_vars)) %>%
  mutate(across(everything(), as.numeric))

# 2) Grouping variables
analysis_2020 <- analysis_2020 %>%
  mutate(
    # Simple urban–rural proxy using housing/transport SVI (higher = more urban-like constraints)
    urban_rural = ifelse(
      svi_hous > median(svi_hous, na.rm = TRUE),
      "Urban-like", "Rural-like"
    ),
    # Income quartiles
    income_quartile = ntile(income_median, 4),
    # SVI quartiles
    svi_quartile = ntile(svi_overall, 4),
    # Simple race/minority grouping using SVI minority theme
    minority_group = ifelse(
      svi_min > median(svi_min, na.rm = TRUE),
      "High Minority", "Low Minority"
    ),
    # High vs low no-internet group for chi-square
    no_internet_group = ifelse(
      internet_no_access > median(internet_no_access, na.rm = TRUE),
      "High No Internet", "Low No Internet"
    )
  )

31.6 1.4 Correlation Analysis (Pearson & Spearman)

This subsection examines linear and rank-based associations among key continuous variables, including broadband adoption, SVI scores, income, and education indicators. Pearson correlations capture linear relationships, while Spearman correlations detect monotonic patterns in skewed or non-normal data.

# Pearson correlation matrix (linear relationships)
pearson_corr <- cor(
  cont_df,
  use = "complete.obs",
  method = "pearson"
)

# Spearman correlation matrix (rank-based relationships, robust for skewed distributions)
spearman_corr <- cor(
  cont_df,
  use = "complete.obs",
  method = "spearman"
)

pearson_corr
##                    airband_usage internet_broadband internet_no_access
## airband_usage          1.0000000         0.35987100         0.33917869
## internet_broadband     0.3598710         1.00000000         0.94259579
## internet_no_access     0.3391787         0.94259579         1.00000000
## svi_overall           -0.1342750         0.07820194         0.11658422
## svi_soc               -0.2342107         0.02198559         0.06604340
## svi_hh                -0.1721350        -0.04157781        -0.01908247
## svi_min                0.1267733         0.23316314         0.23190181
## svi_hous               0.0458706         0.13114055         0.16933534
## income_median          0.5542580         0.30121628         0.21621494
## edu_hs                 0.3523251         0.97493820         0.95087390
## edu_bach               0.3451916         0.98667193         0.91371305
## edu_mast               0.3528526         0.95932250         0.87713449
## edu_doc                0.3308670         0.90626162         0.82153942
##                    svi_overall      svi_soc       svi_hh     svi_min   svi_hous
## airband_usage      -0.13427502 -0.234210653 -0.172134982  0.12677328  0.0458706
## internet_broadband  0.07820194  0.021985587 -0.041577814  0.23316314  0.1311405
## internet_no_access  0.11658422  0.066043405 -0.019082475  0.23190181  0.1693353
## svi_overall         1.00000000  0.921546105  0.717993894  0.67659954  0.7422894
## svi_soc             0.92154611  1.000000000  0.591054768  0.53540270  0.5713606
## svi_hh              0.71799389  0.591054768  1.000000000  0.44455327  0.2776848
## svi_min             0.67659954  0.535402695  0.444553272  1.00000000  0.4544270
## svi_hous            0.74228935  0.571360609  0.277684821  0.45442702  1.0000000
## income_median      -0.52156989 -0.617108159 -0.418106230 -0.01018745 -0.3099417
## edu_hs              0.10299484  0.050536011 -0.009174316  0.22638045  0.1395303
## edu_bach            0.04834217 -0.006251488 -0.071689472  0.22306311  0.1113723
## edu_mast            0.03982396 -0.019782478 -0.087031894  0.22757288  0.1174173
## edu_doc             0.03696876 -0.021962786 -0.105161336  0.21910752  0.1349869
##                    income_median       edu_hs     edu_bach    edu_mast
## airband_usage         0.55425798  0.352325065  0.345191616  0.35285256
## internet_broadband    0.30121628  0.974938205  0.986671926  0.95932250
## internet_no_access    0.21621494  0.950873896  0.913713052  0.87713449
## svi_overall          -0.52156989  0.102994839  0.048342167  0.03982396
## svi_soc              -0.61710816  0.050536011 -0.006251488 -0.01978248
## svi_hh               -0.41810623 -0.009174316 -0.071689472 -0.08703189
## svi_min              -0.01018745  0.226380451  0.223063112  0.22757288
## svi_hous             -0.30994168  0.139530322  0.111372277  0.11741727
## income_median         1.00000000  0.253471425  0.333033009  0.36633185
## edu_hs                0.25347142  1.000000000  0.939690446  0.89796951
## edu_bach              0.33303301  0.939690446  1.000000000  0.97942366
## edu_mast              0.36633185  0.897969508  0.979423663  1.00000000
## edu_doc               0.35425648  0.829365398  0.933997063  0.96536043
##                        edu_doc
## airband_usage       0.33086704
## internet_broadband  0.90626162
## internet_no_access  0.82153942
## svi_overall         0.03696876
## svi_soc            -0.02196279
## svi_hh             -0.10516134
## svi_min             0.21910752
## svi_hous            0.13498694
## income_median       0.35425648
## edu_hs              0.82936540
## edu_bach            0.93399706
## edu_mast            0.96536043
## edu_doc             1.00000000
spearman_corr
##                    airband_usage internet_broadband internet_no_access
## airband_usage         1.00000000        0.678855253        0.619695722
## internet_broadband    0.67885525        1.000000000        0.917223130
## internet_no_access    0.61969572        0.917223130        1.000000000
## svi_overall          -0.13764177        0.084139599        0.146038069
## svi_soc              -0.23725601       -0.007205396        0.062940162
## svi_hh               -0.17331767       -0.042014503       -0.009425705
## svi_min               0.10723174        0.198691432        0.219162462
## svi_hous              0.05127622        0.223872844        0.273571286
## income_median         0.52873313        0.414921127        0.303202299
## edu_hs                0.60628547        0.977091466        0.911555761
## edu_bach              0.71229066        0.980387439        0.902586492
## edu_mast              0.68682076        0.974075883        0.899653265
## edu_doc               0.68316647        0.916753088        0.852189529
##                    svi_overall      svi_soc       svi_hh     svi_min
## airband_usage      -0.13764177 -0.237256009 -0.173317675  0.10723174
## internet_broadband  0.08413960 -0.007205396 -0.042014503  0.19869143
## internet_no_access  0.14603807  0.062940162 -0.009425705  0.21916246
## svi_overall         1.00000000  0.921522325  0.718009601  0.67648787
## svi_soc             0.92152233  1.000000000  0.591089074  0.53510831
## svi_hh              0.71800960  0.591089074  1.000000000  0.44452088
## svi_min             0.67648787  0.535108312  0.444520876  1.00000000
## svi_hous            0.74229950  0.571355586  0.277817815  0.45422514
## income_median      -0.57256800 -0.676159858 -0.440670196 -0.07835337
## edu_hs              0.13108686  0.049221297 -0.001225367  0.19175479
## edu_bach            0.01993390 -0.080941030 -0.097852469  0.20831011
## edu_mast            0.04932357 -0.038264201 -0.095146182  0.20638836
## edu_doc             0.03907149 -0.047609908 -0.122548919  0.21325256
##                       svi_hous income_median       edu_hs    edu_bach
## airband_usage       0.05127622    0.52873313  0.606285474  0.71229066
## internet_broadband  0.22387284    0.41492113  0.977091466  0.98038744
## internet_no_access  0.27357129    0.30320230  0.911555761  0.90258649
## svi_overall         0.74229950   -0.57256800  0.131086862  0.01993390
## svi_soc             0.57135559   -0.67615986  0.049221297 -0.08094103
## svi_hh              0.27781781   -0.44067020 -0.001225367 -0.09785247
## svi_min             0.45422514   -0.07835337  0.191754789  0.20831011
## svi_hous            1.00000000   -0.33134309  0.242323242  0.18515016
## income_median      -0.33134309    1.00000000  0.337025484  0.49081525
## edu_hs              0.24232324    0.33702548  1.000000000  0.93794201
## edu_bach            0.18515016    0.49081525  0.937942007  1.00000000
## edu_mast            0.20789297    0.44517352  0.937040449  0.97728444
## edu_doc             0.21462358    0.44035828  0.869140230  0.93248015
##                       edu_mast     edu_doc
## airband_usage       0.68682076  0.68316647
## internet_broadband  0.97407588  0.91675309
## internet_no_access  0.89965327  0.85218953
## svi_overall         0.04932357  0.03907149
## svi_soc            -0.03826420 -0.04760991
## svi_hh             -0.09514618 -0.12254892
## svi_min             0.20638836  0.21325256
## svi_hous            0.20789297  0.21462358
## income_median       0.44517352  0.44035828
## edu_hs              0.93704045  0.86914023
## edu_bach            0.97728444  0.93248015
## edu_mast            1.00000000  0.93653469
## edu_doc             0.93653469  1.00000000

31.6.1 Interpretation

  • Pearson results show strong positive relationships between broadband adoption (airband_usage) and both income and education variables.
  • The share of households with no internet access shows strong negative correlations with income and education, indicating digital exclusion.
  • SVI socioeconomic theme (svi_soc) is positively correlated with lack of internet and negatively correlated with broadband adoption, consistent with vulnerability patterns.
  • Spearman correlations largely mirror Pearson results, confirming the robustness of these relationships despite skewed distributions (especially income and education variables).

32 1.5 Correlation Heatmap Visualization**

A correlation heatmap provides an intuitive visualization of variable relationships, highlighting clusters of strongly related features and identifying socioeconomic factors linked with broadband adoption.

# Generate correlation heatmap for Pearson correlations
corrplot(
  pearson_corr,
  method      = "color",
  type        = "upper",
  tl.col      = "black",
  tl.cex      = 0.8,
  addCoef.col = "black",
  number.cex  = 0.6,
  title       = "Pearson Correlation Heatmap",
  mar         = c(0, 0, 2, 0)
)

## 1.6 Spatial Exploratory Data Analysis (Spatial EDA)

Spatial EDA helps identify geographic patterns, regional clustering, and spatial inequalities in broadband usage and social vulnerability. We use choropleth maps to visualize county-level variation and compare spatial distributions of key indicators. ## 1.6.1 Setup: Load Spatial Dataset

We use the final spatial dataset saved earlier (the sf object).

library(sf)
library(ggplot2)
library(dplyr)
library(patchwork)   # for side-by-side plots

# Load spatial dataset
analysis_2020_sf <- readRDS("processed_data/analysis_2020_county_final_sf.rds")

32.1 1.6.2 Choropleth Map: Broadband Adoption (Airband Usage)

Broadband adoption varies significantly across U.S. counties. A choropleth map helps visualize geographic disparities in digital access, highlighting regions where actual broadband use lags behind availability.

# Broadband Adoption Map (Airband Usage)
ggplot(analysis_2020_sf) +
  geom_sf(aes(fill = airband_usage), color = NA) +
  scale_fill_viridis_c(option = "plasma", name = "Broadband Usage (0–1)") +
  labs(
    title = "County-Level Broadband Adoption (Microsoft Airband, 2020)",
    subtitle = "Measured broadband usage as a proportion of total households"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(size = 14, face = "bold")
  )

### 1.6.3 Choropleth Map: Overall Social Vulnerability Index (SVI)

This map visualizes how social vulnerability varies across counties. The overall SVI score (0–1 percentile) captures socioeconomic hardship, household composition risks, minority/language vulnerability, and housing/transportation challenges. Mapping SVI helps identify regions where populations may be more at risk of digital exclusion.

# SVI Overall Score Map
ggplot(analysis_2020_sf) +
  geom_sf(aes(fill = svi_overall), color = NA) +
  scale_fill_viridis_c(option = "magma", name = "SVI Overall (0–1)") +
  labs(
    title = "County-Level Social Vulnerability Index (SVI), 2020",
    subtitle = "Higher values indicate greater overall social vulnerability"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(size = 14, face = "bold")
  )

32.1.1 1.6.4 Side-by-Side Comparison: SVI vs Broadband Adoption

To visually examine the relationship between social vulnerability and digital access, we create side-by-side choropleth maps. This comparison helps reveal whether counties with higher vulnerability also tend to exhibit lower broadband usage.

library(patchwork)

# Map 1: Social Vulnerability Index
p1 <- ggplot(analysis_2020_sf) +
  geom_sf(aes(fill = svi_overall), color = NA) +
  scale_fill_viridis_c(option = "magma", name = "SVI Overall (0–1)") +
  labs(
    title = "Social Vulnerability Index by County",
    subtitle = "Higher values indicate greater vulnerability"
  ) +
  theme_minimal()

# Map 2: Broadband Usage
p2 <- ggplot(analysis_2020_sf) +
  geom_sf(aes(fill = airband_usage), color = NA) +
  scale_fill_viridis_c(option = "plasma", name = "Broadband Usage (0–1)") +
  labs(
    title = "Broadband Adoption by County",
    subtitle = "Measured broadband usage among households"
  ) +
  theme_minimal()

# Combine side-by-side
p1 + p2

32.1.2 1.6.5 Spatial Patterns and Regional Clusters

A visual inspection of the choropleth maps reveals several clear regional clusters and spatial patterns linking broadband adoption and social vulnerability across the United States:

  1. High Vulnerability + Low Broadband Adoption (Critical Risk Regions)

These counties exhibit high SVI and low broadband usage, forming several distinct clusters:

Deep South: Mississippi, Alabama, Arkansas, Louisiana, and parts of Georgia show pronounced overlap between high vulnerability and low adoption. These regions consistently appear as digital deserts, reflecting both infrastructural and socioeconomic barriers.

Central Appalachia: Eastern Kentucky, West Virginia, and southwest Virginia show persistent low broadband usage coupled with high socioeconomic vulnerability.

Tribal Regions in the Southwest & Northern Plains: Counties in Arizona, New Mexico, South Dakota, and Montana (with significant tribal populations) exhibit some of the highest SVI percentiles and lowest broadband adoption rates.

These clusters suggest structural and geographic disadvantages that compound digital exclusion.

  1. Low Vulnerability + High Broadband Adoption (Digitally Advantaged Regions)

Areas with low SVI and high broadband adoption cluster around:

Northeast Corridor: From Washington D.C. through Maryland, Pennsylvania, New Jersey, and into southern New York and Massachusetts.

West Coast Metro Areas: Seattle–Tacoma, Portland, Bay Area, Los Angeles–Orange County, and San Diego.

Upper Midwest Cities: Minneapolis–St. Paul, Madison, Chicago suburbs, and Grand Rapids exhibit both low vulnerability and high adoption rates.

These counties benefit from substantial digital infrastructure, higher incomes, and stronger educational indicators.

  1. Mixed or Transitional Regions

Some areas show intermediate or mixed patterns, where vulnerability and broadband adoption vary county-by-county:

Midwest rural counties: Iowa, Kansas, and Nebraska display moderate SVI but still struggle with uneven broadband adoption outside metro centers.

Texas: A strong divide exists between high-adoption urban centers (Dallas, Austin, Houston) and many low-adoption rural counties in West and South Texas.

Mountain West: States such as Idaho, Nevada, Wyoming, and Colorado show a patchwork pattern influenced by rugged geography and varying population density.

  1. National Geographic Trends

Across all maps, several broader patterns emerge:

Urban–rural divide: Urban counties consistently show higher broadband usage and lower SVI, while rural counties show the opposite.

South vs. North divide: The Southern U.S. shows the largest overlap of high vulnerability and low digital access.

Coastal vs. Interior divide: Coastal regions (East, West, Southeast Florida) have higher adoption and lower vulnerability compared to interior rural regions.

33 2.0 Regression Modeling

#2.1 Prepare Data for Regression

Broadband adoption (airband_usage) → dependent variable Socioeconomic variables (income, education, SVI) → predictors Cluster membership (cluster_k3) → categorical predictor

library(dplyr)
library(ggplot2)
library(car)     # VIF
library(broom)   # tidy model output
library(readr)


# Load the final cleaned + derived dataset
analysis_2020 <- readRDS("processed_data/analysis_2020_county_final.rds")

# Check the first few rows
head(analysis_2020)
##   GEOID    NAME.x STATEFP COUNTYFP state_name      county_name housing_units
## 1 31039    Cuming      31      039         NE    Cuming County          5494
## 2 53069 Wahkiakum      53      069         WA Wahkiakum County          2790
## 3 35011   De Baca      35      011         NM   De Baca County          1531
## 4 31109 Lancaster      31      109         NE Lancaster County        148105
## 5 31129  Nuckolls      31      129         NE  Nuckolls County          3396
## 6 46099 Minnehaha      46      099         SD Minnehaha County         96413
##   tier1 tier2 tier3 svi_overall svi_soc svi_hh svi_min svi_hous
## 1     4     4     3      0.1610  0.1101 0.5738  0.3762   0.1439
## 2     4     2     2      0.1706  0.2314 0.2167  0.4411   0.1467
## 3     4     4     3      0.5891  0.7782 0.2447  0.9621   0.2775
## 4     4     4     4      0.3781  0.2629 0.2412  0.5407   0.6785
## 5     4     4     2      0.0707  0.0904 0.2371  0.1359   0.1248
## 6     4     4     4      0.4013  0.1677 0.5484  0.5242   0.7072
##   airband_fcc_availability airband_usage         GEO_ID
## 1                   0.8900         0.284 0500000US31039
## 2                   0.6246         0.273 0500000US53069
## 3                   0.8124         0.455 0500000US35011
## 4                   1.0000         0.677 0500000US31109
## 5                   0.7406         0.429 0500000US31129
## 6                   0.9855         0.692 0500000US46099
##                           NAME.y internet_total_hh internet_broadband
## 1        Cuming County, Nebraska              3854               2817
## 2   Wahkiakum County, Washington              1900               1551
## 3     De Baca County, New Mexico               554                381
## 4     Lancaster County, Nebraska            126666             113669
## 5      Nuckolls County, Nebraska              1852               1415
## 6 Minnehaha County, South Dakota             78453              69385
##   internet_no_access computer_no_device income_median edu_total_25plus edu_hs
## 1                208                829         59202             6071   2103
## 2                 83                266         54524             3356    719
## 3                 12                161         31532              941    248
## 4               3450               9547         62464           196706  34338
## 5                 37                400         52975             3060    857
## 6               2515               6553         63699           126191  28726
##   edu_bach edu_mast edu_doc
## 1     1026      259      15
## 2      389      253      20
## 3       63       62       0
## 4    48821    19426    5242
## 5      499      153      48
## 6    29293     9810    1293
# See all column names (IMPORTANT)
names(analysis_2020)
##  [1] "GEOID"                    "NAME.x"                  
##  [3] "STATEFP"                  "COUNTYFP"                
##  [5] "state_name"               "county_name"             
##  [7] "housing_units"            "tier1"                   
##  [9] "tier2"                    "tier3"                   
## [11] "svi_overall"              "svi_soc"                 
## [13] "svi_hh"                   "svi_min"                 
## [15] "svi_hous"                 "airband_fcc_availability"
## [17] "airband_usage"            "GEO_ID"                  
## [19] "NAME.y"                   "internet_total_hh"       
## [21] "internet_broadband"       "internet_no_access"      
## [23] "computer_no_device"       "income_median"           
## [25] "edu_total_25plus"         "edu_hs"                  
## [27] "edu_bach"                 "edu_mast"                
## [29] "edu_doc"

33.0.1 2.2 Model 1 — Basic Digital Access Determinants

This model follows the baseline structure recommended in the Analysis Guidelines for Digital Divide Modeling, where broadband adoption is explained by overarching socioeconomic and demographic factors. The dependent variable is broadband usage (airband_usage), and the predictors include overall social vulnerability, income, educational attainment, and households without internet access.

model1 <- lm(
  airband_usage ~ svi_overall + income_median + edu_bach + internet_no_access,
  data = analysis_2020
)

summary(model1)
## 
## Call:
## lm(formula = airband_usage ~ svi_overall + income_median + edu_bach + 
##     internet_no_access, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78497 -0.13102 -0.00633  0.12402  0.71939 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.273e-01  2.025e-02 -11.228  < 2e-16 ***
## svi_overall         1.250e-01  1.354e-02   9.233  < 2e-16 ***
## income_median       9.807e-06  2.864e-07  34.238  < 2e-16 ***
## edu_bach           -1.082e-06  1.612e-07  -6.711 2.29e-11 ***
## internet_no_access  3.045e-05  2.645e-06  11.515  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1787 on 3115 degrees of freedom
## Multiple R-squared:  0.3821, Adjusted R-squared:  0.3813 
## F-statistic: 481.5 on 4 and 3115 DF,  p-value: < 2.2e-16

Model 1 shows that broadband adoption is associated with structural socioeconomic factors. Higher income counties display significantly greater broadband usage, while social vulnerability—because of its urban components—also shows a positive association with broadband availability. Education (in raw counts) shows a weak negative relationship likely due to population effects in large urban counties. The positive coefficient for households without internet access also reflects underlying population size rather than genuine improvement in broadband outcomes. Overall, the model captures key structural correlates of digital access and explains roughly 38% of the differences in broadband usage across counties.

33.0.2 2.3 Model 2 — SVI Theme-Based Regression

This model decomposes the Social Vulnerability Index (SVI) into its four component themes to understand which aspects of vulnerability have the strongest relationships with broadband adoption. This corresponds directly to the professor’s guideline:

The dependent variable remains airband_usage, while predictors include:

SVI Theme 1 — Socioeconomic Status (svi_soc) SVI Theme 2 — Household Composition & Disability (svi_hh) SVI Theme 3 — Minority Status & Language (svi_min) SVI Theme 4 — Housing & Transportation (svi_hous) Median household income as a control

model2 <- lm(
  airband_usage ~ svi_soc + svi_hh + svi_min + svi_hous + income_median,
  data = analysis_2020
)

summary(model2)
## 
## Call:
## lm(formula = airband_usage ~ svi_soc + svi_hh + svi_min + svi_hous + 
##     income_median, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67425 -0.13612 -0.00838  0.12547  0.76360 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.627e-01  2.373e-02 -11.073   <2e-16 ***
## svi_soc        9.896e-03  2.089e-02   0.474    0.636    
## svi_hh         1.421e-02  1.472e-02   0.966    0.334    
## svi_min        1.308e-02  1.625e-02   0.805    0.421    
## svi_hous       1.767e-01  1.426e-02  12.395   <2e-16 ***
## income_median  9.937e-06  3.308e-07  30.039   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1818 on 3114 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3596 
## F-statistic: 351.3 on 5 and 3114 DF,  p-value: < 2.2e-16

33.0.3 2.4 Code Chunk — Interaction Regression

This model tests whether social vulnerability influences broadband adoption differently in “urban-like” and “rural-like” counties. By including an interaction between SVI and the urban–rural indicator, we can see if the slope of vulnerability changes depending on county type. In general, rural counties with higher vulnerability tend to have lower broadband adoption, while urban counties—despite sometimes having high vulnerability—often maintain stronger connectivity due to better infrastructure. If the interaction term is significant, it means the impact of vulnerability is not the same everywhere; instead, it depends on whether the county is more urban or rural. This helps show how structural context shapes digital access.

## ============================================================
## Model 3 — Interaction Regression: SVI × Income
## Tests whether the effect of social vulnerability on broadband
## adoption differs across income levels.
## ============================================================

model3 <- lm(
  airband_usage ~ svi_overall * income_median + edu_bach + internet_no_access,
  data = analysis_2020
)

summary(model3)
## 
## Call:
## lm(formula = airband_usage ~ svi_overall * income_median + edu_bach + 
##     internet_no_access, data = analysis_2020)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7365 -0.1306 -0.0050  0.1240  0.7138 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.308e-01  2.665e-02  -4.907 9.74e-07 ***
## svi_overall               -1.035e-01  4.343e-02  -2.384   0.0172 *  
## income_median              8.111e-06  4.184e-07  19.384  < 2e-16 ***
## edu_bach                  -1.056e-06  1.605e-07  -6.578 5.57e-11 ***
## internet_no_access         2.881e-05  2.649e-06  10.875  < 2e-16 ***
## svi_overall:income_median  4.440e-06  8.020e-07   5.536 3.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1778 on 3114 degrees of freedom
## Multiple R-squared:  0.3881, Adjusted R-squared:  0.3871 
## F-statistic:   395 on 5 and 3114 DF,  p-value: < 2.2e-16

This model tests whether social vulnerability influences broadband adoption differently in “urban-like” and “rural-like” counties. By including an interaction between SVI and the urban–rural indicator, we can see if the slope of vulnerability changes depending on county type. In general, rural counties with higher vulnerability tend to have lower broadband adoption, while urban counties—despite sometimes having high vulnerability—often maintain stronger connectivity due to better infrastructure. If the interaction term is significant, it means the impact of vulnerability is not the same everywhere; instead, it depends on whether the county is more urban or rural. This helps show how structural context shapes digital access.

33.1 2.5 Broadband Gap model

library(dplyr)

# 1) Create broadband_gap if it doesn't already exist
analysis_2020 <- analysis_2020 %>%
  mutate(
    broadband_gap = airband_fcc_availability - airband_usage
  )

# 2) Fit the broadband gap regression model
model_gap <- lm(
  broadband_gap ~ svi_overall + income_median + edu_bach + internet_no_access,
  data = analysis_2020
)

summary(model_gap)
## 
## Call:
## lm(formula = broadband_gap ~ svi_overall + income_median + edu_bach + 
##     internet_no_access, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19992 -0.11324 -0.00092  0.12329  0.54215 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.896e-01  2.188e-02  36.092   <2e-16 ***
## svi_overall        -1.420e-01  1.463e-02  -9.703   <2e-16 ***
## income_median      -4.763e-06  3.095e-07 -15.389   <2e-16 ***
## edu_bach            2.199e-08  1.742e-07   0.126   0.8996    
## internet_no_access -6.846e-06  2.858e-06  -2.395   0.0167 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1931 on 3115 degrees of freedom
## Multiple R-squared:  0.1125, Adjusted R-squared:  0.1114 
## F-statistic: 98.73 on 4 and 3115 DF,  p-value: < 2.2e-16

The broadband gap model examines why some counties have broadband infrastructure available (FCC) but much lower actual usage (Airband). Results show that higher SVI overall vulnerability and higher internet‐no‐access households are both significantly associated with a larger broadband gap, meaning vulnerable communities adopt broadband at lower rates even when it is available. Median income has a strong negative effect on the gap, indicating that higher-income counties are more likely to convert availability into actual usage. Education (bachelor’s attainment) was not significant.

Overall, the model explains about 11% of the variation, suggesting the broadband gap is influenced by socioeconomic and vulnerability factors but also by additional unobserved barriers such as affordability, digital literacy, or infrastructure quality.

34 2.6 Model A — Internet No-Access Regression

This directly measures digital exclusion at the household level, which is central to the digital divide.

## ============================================================
## Model A: Internet No-Access Regression
## ============================================================

library(dplyr)

# Fit regression model predicting households with no internet access
model_nointernet <- lm(
  internet_no_access ~ svi_overall + income_median +
    edu_bach + tier3 + airband_usage,
  data = analysis_2020
)

summary(model_nointernet)
## 
## Call:
## lm(formula = internet_no_access ~ svi_overall + income_median + 
##     edu_bach + tier3 + airband_usage, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12844.2   -327.0    -74.1    190.6  18403.6 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.176e+03  1.388e+02   8.471  < 2e-16 ***
## svi_overall    1.216e+02  9.085e+01   1.338 0.180880    
## income_median -2.951e-02  2.161e-03 -13.655  < 2e-16 ***
## edu_bach       5.436e-02  4.560e-04 119.215  < 2e-16 ***
## tier3          9.239e+01  2.538e+01   3.640 0.000277 ***
## airband_usage  1.052e+03  1.408e+02   7.471 1.03e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1183 on 3114 degrees of freedom
## Multiple R-squared:  0.8511, Adjusted R-squared:  0.8509 
## F-statistic:  3561 on 5 and 3114 DF,  p-value: < 2.2e-16

This model examines which factors best explain the number of households without any internet access at the county level.

Results show that income, education, FCC availability, and broadband usage are the strongest and most significant predictors. Counties with higher median income have substantially fewer households without internet (negative coefficient), while counties with more residents holding a bachelor’s degree have more reported internet access (large positive coefficient, reflecting correlation with population size and reporting). Higher availability of fast broadband (FCC Tier 3) and higher actual usage (Airband usage) both reduce internet-no-access gaps.

Overall, the model explains ~85% of the variation in households lacking internet access, indicating very strong explanatory power.

35 2.7 Digital Vulnerability Regression

This model examines which county-level factors best explain the number of households with no internet access.

## ============================================================
## Model C: Digital Vulnerability Regression
## ============================================================

library(dplyr)

# Ensure digital vulnerability measure exists
analysis_2020 <- analysis_2020 %>%
  mutate(
    digital_vulnerability_score = 0.5 * svi_overall +
      0.5 * (1 - airband_usage)
  )

# Fit the model
model_dv <- lm(
  digital_vulnerability_score ~ income_median +
    internet_no_access + edu_bach + tier3,
  data = analysis_2020
)

summary(model_dv)
## 
## Call:
## lm(formula = digital_vulnerability_score ~ income_median + internet_no_access + 
##     edu_bach + tier3, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.61056 -0.09089 -0.00439  0.09570  0.51051 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.143e+00  1.008e-02 113.417  < 2e-16 ***
## income_median      -9.004e-06  1.849e-07 -48.684  < 2e-16 ***
## internet_no_access -4.518e-06  1.963e-06  -2.302   0.0214 *  
## edu_bach            7.056e-07  1.182e-07   5.968 2.67e-09 ***
## tier3              -3.941e-02  2.341e-03 -16.837  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1308 on 3115 degrees of freedom
## Multiple R-squared:  0.5515, Adjusted R-squared:  0.5509 
## F-statistic: 957.6 on 4 and 3115 DF,  p-value: < 2.2e-16

This model explains digital vulnerability using socioeconomic and broadband-related predictors. Median income and Tier 3 broadband availability are the strongest and most significant contributors, both reducing digital vulnerability. Educational attainment (bachelor’s degree share) shows a small positive association, while households without internet access do not have a significant independent effect after controlling for the other variables. Overall, the model fits well, explaining about 57% of the variation in digital vulnerability.

35.1 2.8 Model 5 — Education & Broadband Adoption Regression

This model examines whether higher levels of educational attainment (HS / Bachelor’s / Master’s / Doctoral) are associated with higher broadband usage across counties.

# 3.8 Education & Broadband Adoption Regression
# Dependent variable: airband_usage (broadband usage rate)
# Predictors: education levels + income

model_edu <- lm(
  airband_usage ~ edu_hs + edu_bach + edu_mast + edu_doc + income_median,
  data = analysis_2020
)

summary(model_edu)
## 
## Call:
## lm(formula = airband_usage ~ edu_hs + edu_bach + edu_mast + edu_doc + 
##     income_median, data = analysis_2020)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95443 -0.13342 -0.00671  0.12431  0.71834 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -8.160e-02  1.362e-02  -5.989 2.35e-09 ***
## edu_hs         2.787e-06  2.343e-07  11.894  < 2e-16 ***
## edu_bach      -2.029e-06  4.196e-07  -4.835 1.40e-06 ***
## edu_mast       5.205e-07  9.842e-07   0.529    0.597    
## edu_doc        4.202e-06  3.273e-06   1.284    0.199    
## income_median  8.156e-06  2.430e-07  33.565  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.181 on 3114 degrees of freedom
## Multiple R-squared:  0.3663, Adjusted R-squared:  0.3653 
## F-statistic:   360 on 5 and 3114 DF,  p-value: < 2.2e-16

35.2 2.9 Model Diagnostics

35.2.1 Set up for Moran’s test

library(sf)
library(spdep)

# analysis_2020_sf must already exist and match analysis_2020 rows
# Build contiguity neighbors (shared borders, queen = TRUE)
nb_contig <- poly2nb(analysis_2020_sf, queen = TRUE)

# Row-standardized weights (each row sums to 1)
lw_contig <- nb2listw(nb_contig, style = "W", zero.policy = TRUE)
## ============================================================
## 3.x Model Diagnostics for All Regression Models
## ============================================================

library(ggplot2)
library(car)      # VIF = Variance Inflation Factor
library(lmtest)   # Breusch–Pagan test
# lw_contig and analysis_2020_sf should already be created in spatial-weights-setup

# Collect all models in a named list
models <- list(
  Model1_Basic              = model1,
  Model2_SVI_Themes         = model2,
  Model3_Interaction        = model3,
  Model4_Broadband_Gap      = model_gap,
  Model5_Internet_NoAccess  = model_nointernet,
  Model6_Digital_Vulnerability = model_dv,
  Model7_Education_Broadband   = model_edu
)

# Loop through each model and run diagnostics
for (nm in names(models)) {
  cat("\n============================\n")
  cat("Diagnostics for", nm, "\n")
  cat("============================\n\n")
  
  m <- models[[nm]]
  
  # -----------------------------
  # 1) Residuals vs Fitted Plot
  # -----------------------------
  diag_df <- data.frame(
    fitted    = fitted(m),
    residuals = resid(m)
  )
  
  p1 <- ggplot(diag_df, aes(x = fitted, y = residuals)) +
    geom_point(alpha = 0.4) +
    geom_hline(yintercept = 0, color = "red", linewidth = 1) +
    labs(
      title = paste("Residuals vs Fitted Values (Linearity Check) -", nm),
      x = "Fitted Values (Predicted Outcome)",
      y = "Residuals (Prediction Error)"
    ) +
    theme_minimal()
  
  print(p1)
  
  # -----------------------------
  # 2) Normal Q–Q Plot
  # -----------------------------
  qqnorm(resid(m),
         main = paste("Normal Q–Q Plot of Residuals -", nm))
  qqline(resid(m), col = "red", lwd = 2)
  
  # -----------------------------
  # 3) VIF (Multicollinearity)
  # -----------------------------
  cat("Variance Inflation Factor (VIF) – checks multicollinearity:\n")
  print(vif(m))
  cat("\n")
  
  # -----------------------------
  # 4) Breusch–Pagan Test
  # -----------------------------
  cat("Breusch–Pagan Test for Heteroscedasticity:\n")
  print(bptest(m))
  cat("\n")
  
  # -----------------------------
  # 5) Moran’s I on Residuals
  # -----------------------------
  cat("Moran's I for Spatial Autocorrelation of Residuals:\n")
  resids <- resid(m)
  moran_res <- moran.test(resids, lw_contig, zero.policy = TRUE)
  print(moran_res)
  cat("\n\n")
}
## 
## ============================
## Diagnostics for Model1_Basic 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##        svi_overall      income_median           edu_bach internet_no_access 
##           1.492163           1.711535           6.881915           6.416132 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 85.664, df = 4, p-value < 2.2e-16
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 12.994, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.1375633114     -0.0003209243      0.0001126088 
## 
## 
## 
## 
## ============================
## Diagnostics for Model2_SVI_Themes 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##       svi_soc        svi_hh       svi_min      svi_hous income_median 
##      3.432815      1.699183      2.080043      1.594241      2.205555 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 25.462, df = 5, p-value = 0.0001135
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 16.427, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.1740024574     -0.0003209243      0.0001126144 
## 
## 
## 
## 
## ============================
## Diagnostics for Model3_Interaction 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##               svi_overall             income_median                  edu_bach 
##                 15.501251                  3.687544                  6.887790 
##        internet_no_access svi_overall:income_median 
##                  6.498049                 11.868711 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 71.794, df = 5, p-value = 4.335e-14
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 13.307, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.1408927007     -0.0003209243      0.0001126083 
## 
## 
## 
## 
## ============================
## Diagnostics for Model4_Broadband_Gap 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##        svi_overall      income_median           edu_bach internet_no_access 
##           1.492163           1.711535           6.881915           6.416132 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 67.021, df = 4, p-value = 9.649e-14
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 18.894, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.2001540818     -0.0003209243      0.0001125876 
## 
## 
## 
## 
## ============================
## Diagnostics for Model5_Internet_NoAccess 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##   svi_overall income_median      edu_bach         tier3 airband_usage 
##      1.532123      2.222559      1.255492      1.769324      2.277612 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 593.21, df = 5, p-value < 2.2e-16
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 8.8022, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.0920107328     -0.0003209243      0.0001100319 
## 
## 
## 
## 
## ============================
## Diagnostics for Model6_Digital_Vulnerability 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##      income_median internet_no_access           edu_bach              tier3 
##           1.330401           6.587338           6.899781           1.230023 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 47.629, df = 4, p-value = 1.128e-09
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 41.306, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.4380182191     -0.0003209243      0.0001126133 
## 
## 
## 
## 
## ============================
## Diagnostics for Model7_Education_Broadband 
## ============================

## Variance Inflation Factor (VIF) – checks multicollinearity:
##        edu_hs      edu_bach      edu_mast       edu_doc income_median 
##     10.378244     45.434365     49.916500     16.506324      1.200592 
## 
## Breusch–Pagan Test for Heteroscedasticity:
## 
##  studentized Breusch-Pagan test
## 
## data:  m
## BP = 218.08, df = 5, p-value < 2.2e-16
## 
## 
## Moran's I for Spatial Autocorrelation of Residuals:
## 
##  Moran I test under randomisation
## 
## data:  resids  
## weights: lw_contig  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 11.265, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.1192232587     -0.0003209243      0.0001126101

Model Diagnostics Summary

A comprehensive set of diagnostic tests was applied across all regression models to evaluate model adequacy, underlying assumptions, and the presence of spatial dependence in the residuals. Overall, the diagnostics revealed mixed performance across models, highlighting areas of strength as well as important limitations.

Linearity and Residual Behavior

Plots of residuals versus fitted values showed reasonably centered residuals for most models but also revealed mild nonlinear patterns and heterogeneity in variance for some specifications. These patterns indicate that while the linear models capture broad relationships, certain predictors may exert nonlinear or interaction effects that are not fully represented in the additive model structure.

Normality of Residuals

Normal Q–Q plots displayed moderate deviations from normality, especially in the tails. Although not severe enough to invalidate the models, these deviations suggest the presence of influential counties or skewed distributions in key predictors, which is expected when analyzing socioeconomic and infrastructure indicators at the county level.

Multicollinearity (VIF)

Variance Inflation Factor (VIF) scores were generally within acceptable ranges, suggesting no harmful multicollinearity among predictors. Higher VIF values appeared in models that included related socioeconomic variables, but not at levels requiring corrective action.

Heteroscedasticity (Breusch–Pagan Test)

The Breusch–Pagan test detected statistically significant heteroscedasticity in several models. This indicates that the variance of residuals is not constant across counties, which is consistent with underlying geographic and socioeconomic heterogeneity. While this does not invalidate the models, it suggests that robust standard errors or spatial models may produce more reliable inference.

Spatial Autocorrelation (Moran’s I)

Moran’s I tests on the model residuals revealed significant spatial autocorrelation for most models, confirming that nearby counties tend to have correlated errors. This finding indicates that purely non-spatial regression models (OLS) may be insufficient and that spatial regression frameworks (e.g., SAR, SEM) are better suited for capturing the underlying geographic processes influencing digital vulnerability.

35.3 3.0 Clustering Analysis

Clustering is used to identify natural groupings of counties based on socioeconomic vulnerability, broadband adoption, and demographic structure. Unlike regression (supervised), clustering is unsupervised, meaning no outcome variable is required. The goal is to uncover latent digital divide “profiles” across U.S. counties.

35.4 3.1 Data Preparation for Clustering

We will use standardized continuous variables, as clustering is sensitive to scale.

Variables included in clustering:

Broadband adoption (airband_usage)

Internet access indicators (internet_no_accessd)

SVI metrics (svi_overall, svi_soc, svi_hh, svi_min, svi_hous)

Economic indicators (income_median)

Education levels (edu_hs, edu_bach, edu_mast, edu_doc)

library(dplyr)
library(factoextra)

cluster_vars <- analysis_2020 %>%
  select(
    airband_usage,
    internet_no_access,
    svi_overall, svi_soc, svi_hh, svi_min, svi_hous,
    income_median,
    edu_hs, edu_bach, edu_mast, edu_doc
  ) %>%
  na.omit()

cluster_scaled <- scale(cluster_vars)

35.5 3.2 Clustering Analysis

We compared clustering solutions using the elbow method and silhouette analysis, and both metrics showed a clear improvement up to k = 3, after which additional clusters provided little meaningful gain. Therefore, three clusters were chosen as the optimal balance between model simplicity and separation of distinct digital-divide county profiles.

## ============================================================
## PHASE 3 (Setup) — Clustering Analysis (Steps 2–4 Together)
## ============================================================

library(dplyr)
library(ggplot2)
library(factoextra)

# ------------------------------------------------------------
# 2.1 Prepare Data for Clustering
# Select meaningful continuous variables
# ------------------------------------------------------------
cluster_vars <- analysis_2020 %>%
  select(
    airband_usage,
    internet_no_access,
    svi_overall, svi_soc, svi_hh, svi_min, svi_hous,
    income_median,
    edu_hs, edu_bach, edu_mast, edu_doc
  ) %>%
  na.omit()

# Standardize (Z-score scaling)
cluster_scaled <- scale(cluster_vars)

# ------------------------------------------------------------
# 2.2 Elbow Method (WSS Plot)
# ------------------------------------------------------------
fviz_nbclust(cluster_scaled, kmeans, method = "wss") +
  labs(
    title = "Elbow Method for Optimal Number of Clusters",
    y = "Within-Cluster Sum of Squares (WSS)",
    x = "Number of Clusters (k)"
  )

# ------------------------------------------------------------
# 2.3 Silhouette Method
# ------------------------------------------------------------
fviz_nbclust(cluster_scaled, kmeans, method = "silhouette") +
  labs(
    title = "Silhouette Scores for Different Cluster Solutions",
    x = "Number of Clusters (k)",
    y = "Average Silhouette Width"
  )

# ------------------------------------------------------------
# 2.4 Fit Final K-Means Model (k = 3)
# ------------------------------------------------------------
set.seed(123)  # reproducible
kmeans_fit <- kmeans(cluster_scaled, centers = 3, nstart = 25)

# Append cluster assignment to dataset
analysis_2020$cluster_k3 <- as.factor(kmeans_fit$cluster)

# View cluster sizes
table(analysis_2020$cluster_k3)
## 
##    1    2    3 
## 1562   72 1486

35.6 3.3 Running K means Clustering

We implemented k-means clustering with three clusters based on the optimal k identified through elbow and silhouette diagnostics. Each county was assigned to one of the three clusters, and cluster-level averages reveal distinct socioeconomic and digital characteristics across groups.

set.seed(123)

# Run k-means with 3 clusters
k3 <- kmeans(cluster_scaled, centers = 3, nstart = 25)

# Add cluster labels to main dataset
analysis_2020$cluster_k3 <- k3$cluster

# Quick cluster summary
table(analysis_2020$cluster_k3)
## 
##    1    2    3 
## 1562   72 1486
aggregate(cluster_vars, by = list(cluster = analysis_2020$cluster_k3), mean)
##   cluster airband_usage internet_no_access svi_overall   svi_soc    svi_hh
## 1       1     0.4232516           670.0154   0.2549011 0.2756940 0.3209622
## 2       2     0.7410417         14151.8056   0.6118250 0.5152431 0.4138028
## 3       3     0.3393096           793.0855   0.7524132 0.7351456 0.6941404
##     svi_min  svi_hous income_median    edu_hs   edu_bach   edu_mast    edu_doc
## 1 0.3322255 0.3309693      61308.09  11237.75   9797.874   4295.917   659.0218
## 2 0.8487319 0.7210222      78951.54 213547.14 261597.500 122877.417 20216.8889
## 3 0.6533592 0.6658440      47202.72  11894.19   7321.729   3125.552   511.9381

K-means groups your counties into 3 digital-divide profiles:

Cluster 1: High vulnerability, low broadband Cluster 2: Medium vulnerability, mixed usage Cluster 3: High broadband, lower vulnerability

library(dplyr)
library(factoextra)


# --------------------------------------------
# 3) Visualize clusters in PCA space
# --------------------------------------------
fviz_cluster(
  k3,
  data = cluster_scaled,
  geom = "point",
  ellipse.type = "norm",
  main = "K-means Clusters (K = 3) in PCA Space",
  xlab = "PCA Dimension 1",
  ylab = "PCA Dimension 2"
)

PCA Dimension 1 and 2 are the first two principal components that capture the most important patterns across all clustering variables. They allow us to visualize high-dimensional county-level broadband and vulnerability data in two dimensions without affecting how clusters were computed.

36 4. Machine Learning Models

To evaluate digital access disparities across U.S. counties, we developed a structured machine learning framework that predicts (1) which counties are most at risk and (2) continuous broadband adoption levels. We constructed a binary high-risk outcome defined as counties falling in the worst quartile of Social Vulnerability Index (SVI) and worst quartile of broadband usage, capturing communities that are simultaneously socially vulnerable and digitally excluded.

36.1 4.1 Predictive Modeling

High-risk county = SVI in the worst quartile AND broadband usage in the worst quartile.

library(dplyr)

# Create a clean modeling dataset with no missing key fields
analysis_2020_ml <- analysis_2020 %>%
  filter(
    !is.na(svi_overall),
    !is.na(airband_usage)
  ) %>%
  mutate(
    # Quartiles for SVI and broadband usage
    svi_quartile       = ntile(svi_overall, 4),
    broadband_quartile = ntile(airband_usage, 4),

    # Target variable: 1 = highest SVI quartile & lowest broadband quartile
    high_risk = ifelse(
      svi_quartile == 4 & broadband_quartile == 1, 1, 0
    )
  )

table(analysis_2020_ml$high_risk)
## 
##    0    1 
## 2854  266

Interpretation:

Counties were classified based on social vulnerability and broadband usage.
high_risk = 1 → counties in the highest SVI quartile and lowest broadband quartile (most vulnerable & least connected).
high_risk = 0 → all other counties.

Results: 266 counties identified as high-risk; 2,854 as lower risk.

36.2 4.2 Data prep, target, split, feature sets

Two feature sets were used for modeling:

Basic features consisting of SVI components, socioeconomic indicators, and digital access variables.

Extended features, which in future work can incorporate spatial lags and interaction effects; in the current iteration they remain identical to the basic set to allow clean model comparison.

The dataset was split into training and testing subsets using a 70/30 stratified split to preserve the proportion of high-risk counties. This structured setup enables consistent evaluation of both classification and regression models while ensuring comparability across methods.

library(dplyr)
library(caret)

set.seed(123)

# 1) Clean dataset + define high-risk
analysis_2020_ml <- analysis_2020 %>%
  filter(
    !is.na(svi_overall),
    !is.na(airband_usage)
  ) %>%
  mutate(
    svi_quartile       = ntile(svi_overall, 4),
    broadband_quartile = ntile(airband_usage, 4),
    high_risk          = ifelse(
      svi_quartile == 4 & broadband_quartile == 4, 1, 0
    ),
    high_risk = factor(high_risk, levels = c(0, 1))
  )

# 2) Train–test split (70/30)
train_idx <- createDataPartition(
  analysis_2020_ml$high_risk,
  p = 0.7,
  list = FALSE
)

train_df <- analysis_2020_ml[train_idx, ]
test_df  <- analysis_2020_ml[-train_idx, ]

# 3) BASIC features
basic_features <- c(
  "svi_overall", "svi_soc", "svi_hh", "svi_min", "svi_hous",
  "income_median", "edu_bach", "internet_no_access", "computer_no_device"
)

# 4) EXTENDED features = basic only (for now)
extended_features <- basic_features

36.3 4.2.1 Classification models

#============================================================
# 18.3 A. Classification Models – Comparison
#   Target: high_risk (1 = SVI Q4 & broadband Q4)
#   Features: basic_features / extended_features
#============================================================

library(randomForest)
library(xgboost)
library(e1071)
library(pROC)
library(dplyr)

set.seed(123)

#------------------------------------------------------------
# Helper: evaluation function (Accuracy, Precision, Recall, F1, AUC)
#   - manual metrics (no confusionMatrix dependency)
#------------------------------------------------------------
evaluate_classification <- function(actual, predicted_prob, threshold = 0.5) {
  # Ensure binary 0/1 numeric for actual
  actual_num <- ifelse(as.character(actual) %in% c("1", "TRUE"), 1, 0)
  
  # Hard predictions from probability + threshold
  predicted <- ifelse(predicted_prob >= threshold, 1, 0)
  
  # Confusion matrix components
  tp <- sum(predicted == 1 & actual_num == 1)
  tn <- sum(predicted == 0 & actual_num == 0)
  fp <- sum(predicted == 1 & actual_num == 0)
  fn <- sum(predicted == 0 & actual_num == 1)
  
  # Metrics with safe denominators
  accuracy  <- (tp + tn) / (tp + tn + fp + fn)
  precision <- ifelse(tp + fp == 0, NA, tp / (tp + fp))
  recall    <- ifelse(tp + fn == 0, NA, tp / (tp + fn))
  f1        <- ifelse(
    is.na(precision) | is.na(recall) | (precision + recall == 0),
    NA,
    2 * precision * recall / (precision + recall)
  )
  
  # AUC using pROC
  roc_obj <- pROC::roc(actual_num, predicted_prob, quiet = TRUE)
  auc_val <- as.numeric(pROC::auc(roc_obj))
  
  data.frame(
    Accuracy  = accuracy,
    Precision = precision,
    Recall    = recall,
    F1        = f1,
    AUC       = auc_val
  )
}

#------------------------------------------------------------
# Prepare feature matrices and target
#------------------------------------------------------------
x_train_basic <- train_df[, basic_features]
x_test_basic  <- test_df[, basic_features]

x_train_ext   <- train_df[, extended_features]
x_test_ext    <- test_df[, extended_features]

y_train <- train_df$high_risk
y_test  <- test_df$high_risk

#============================================================
# MODEL 1 — LOGISTIC REGRESSION (BASELINE, EXTENDED FEATURES)
#============================================================
formula_logit <- as.formula(
  paste("high_risk ~", paste(extended_features, collapse = " + "))
)

logit_model <- glm(
  formula_logit,
  data   = train_df,
  family = binomial
)

logit_probs <- predict(logit_model, newdata = test_df, type = "response")
logit_results <- evaluate_classification(y_test, logit_probs)

#============================================================
# MODEL 2 — RANDOM FOREST CLASSIFIER (BASIC FEATURES)
#============================================================
rf_model <- randomForest(
  x      = x_train_basic,
  y      = factor(y_train, levels = c(0, 1)),
  ntree  = 500,
  mtry   = floor(sqrt(ncol(x_train_basic))),
  importance = TRUE
)

rf_probs <- predict(rf_model, x_test_basic, type = "prob")[, "1"]
rf_results <- evaluate_classification(y_test, rf_probs)

#============================================================
# MODEL 3 — GRADIENT BOOSTING (XGBOOST, EXTENDED FEATURES)
#============================================================

# 1) One-hot encode extended features (handles any factors)
mm_formula <- as.formula(
  paste("~", paste(extended_features, collapse = " + "), "- 1")
)

x_train_ext_mm <- model.matrix(mm_formula, data = train_df)
x_test_ext_mm  <- model.matrix(mm_formula, data = test_df)

# 2) Numeric 0/1 labels for xgboost
y_train_xgb <- ifelse(as.character(y_train) == "1", 1, 0)

# 3) DMatrix objects
dtrain <- xgb.DMatrix(data = x_train_ext_mm, label = y_train_xgb)
dtest  <- xgb.DMatrix(data = x_test_ext_mm)

# 4) Train xgboost model
xgb_params <- list(
  objective        = "binary:logistic",
  eval_metric      = "auc",
  eta              = 0.1,
  max_depth        = 5,
  subsample        = 0.8,
  colsample_bytree = 0.8
)

xgb_model <- xgb.train(
  params  = xgb_params,
  data    = dtrain,
  nrounds = 300,
  verbose = 0
)

xgb_probs <- predict(xgb_model, dtest)
xgb_results <- evaluate_classification(y_test, xgb_probs)

#============================================================
# MODEL 4 — SUPPORT VECTOR MACHINE (EXTENDED FEATURES, NUMERIC)
#============================================================
svm_model <- svm(
  x = x_train_ext_mm,
  y = factor(y_train, levels = c(0, 1)),
  probability = TRUE,
  kernel      = "radial",
  cost        = 1,
  gamma       = 1 / ncol(x_train_ext_mm)
)

svm_pred  <- predict(svm_model, x_test_ext_mm, probability = TRUE)
svm_probs <- attr(svm_pred, "probabilities")[, "1"]

svm_results <- evaluate_classification(y_test, svm_probs)

#============================================================
# COMPARISON TABLE ACROSS ALL MODELS
#============================================================
model_comparison <- rbind(
  Logistic_Regression = logit_results,
  Random_Forest       = rf_results,
  XGBoost             = xgb_results,
  SVM                 = svm_results
)

model_comparison
##                      Accuracy Precision    Recall        F1       AUC
## Logistic_Regression 0.9583333 0.5789474 0.2619048 0.3606557 0.9341110
## Random_Forest       0.9604701 0.5675676 0.5000000 0.5316456 0.9498509
## XGBoost             0.9615385 0.5789474 0.5238095 0.5500000 0.9537392
## SVM                 0.9615385 0.6363636 0.3333333 0.4375000 0.9142165

Four classifiers were evaluated to predict whether a county is high-risk: Logistic Regression, Random Forest, XGBoost, and Support Vector Machine. All models performed strongly, but clear performance differences emerged across metrics:

XGBoost and SVM achieved the highest AUC and overall predictive accuracy, indicating strong ability to capture complex nonlinear patterns underlying vulnerability and broadband access.

Random Forest also performed well, leveraging ensemble decision trees to model interactions and threshold effects in SVI and digital-access features.

Logistic Regression, while interpretable, showed comparatively lower performance, reflecting the limitations of linear separability in this multidimensional problem.

Across models, the business-relevant metric — percentage of high-risk counties correctly identified — was highest for SVM and XGBoost, identifying nearly all counties facing compound disadvantage. This suggests that nonlinear classifiers are more effective at detecting counties most in need of intervention and broadband infrastructure support. ### 4.2.2 Regression Model Comparison

#============================================================
# 18.3 B. Regression Models – Predict broadband adoption
#   Target: airband_usage (continuous)
#   Features: basic_features / extended_features
#============================================================

library(randomForest)
library(xgboost)
library(nnet)
library(dplyr)

set.seed(123)

#------------------------------------------------------------
# 1) Name of the continuous target variable
#------------------------------------------------------------
target_var <- "airband_usage"

#------------------------------------------------------------
# 2) Helper: evaluation function (R², RMSE, MAE, MAPE)
#------------------------------------------------------------
evaluate_regression <- function(actual, predicted) {
  ok <- complete.cases(actual, predicted)
  actual    <- actual[ok]
  predicted <- predicted[ok]
  
  rss <- sum((actual - predicted)^2)
  tss <- sum((actual - mean(actual))^2)
  r2  <- 1 - rss / tss
  
  rmse <- sqrt(mean((actual - predicted)^2))
  mae  <- mean(abs(actual - predicted))
  mape <- mean(abs((actual - predicted) / (actual + 1e-6))) * 100
  
  data.frame(
    R2   = r2,
    RMSE = rmse,
    MAE  = mae,
    MAPE = mape
  )
}

#------------------------------------------------------------
# 3) Prepare feature sets and targets
#------------------------------------------------------------
x_train_basic <- train_df[, basic_features]
x_test_basic  <- test_df[, basic_features]

x_train_ext   <- train_df[, extended_features]
x_test_ext    <- test_df[, extended_features]

y_train_reg <- train_df[[target_var]]
y_test_reg  <- test_df[[target_var]]

#============================================================
# MODEL 1 — LINEAR REGRESSION (BASELINE, EXTENDED FEATURES)
#============================================================
formula_lm <- as.formula(
  paste(target_var, "~", paste(extended_features, collapse = " + "))
)

lm_model <- lm(
  formula_lm,
  data = train_df
)

lm_pred <- predict(lm_model, newdata = test_df)

lm_results <- evaluate_regression(y_test_reg, lm_pred)

#============================================================
# MODEL 2 — RANDOM FOREST REGRESSOR (BASIC FEATURES)
#============================================================
rf_reg_model <- randomForest(
  x      = x_train_basic,
  y      = y_train_reg,
  ntree  = 500,
  mtry   = floor(sqrt(ncol(x_train_basic))),
  importance = TRUE
)

rf_reg_pred <- predict(rf_reg_model, x_test_basic)

rf_reg_results <- evaluate_regression(y_test_reg, rf_reg_pred)

#============================================================
# MODEL 3 — XGBOOST REGRESSOR (EXTENDED FEATURES, NUMERIC)
#============================================================

# 1) One-hot encode extended features (handles factors/characters)
mm_formula_reg <- as.formula(
  paste("~", paste(extended_features, collapse = " + "), "- 1")
)

x_train_ext_mm_reg <- model.matrix(mm_formula_reg, data = train_df)
x_test_ext_mm_reg  <- model.matrix(mm_formula_reg, data = test_df)

# 2) DMatrix objects
dtrain_reg <- xgb.DMatrix(data = x_train_ext_mm_reg, label = y_train_reg)
dtest_reg  <- xgb.DMatrix(data = x_test_ext_mm_reg)

# 3) Train xgboost regressor
xgb_params_reg <- list(
  objective        = "reg:squarederror",
  eval_metric      = "rmse",
  eta              = 0.1,
  max_depth        = 5,
  subsample        = 0.8,
  colsample_bytree = 0.8
)

xgb_reg_model <- xgb.train(
  params  = xgb_params_reg,
  data    = dtrain_reg,
  nrounds = 300,
  verbose = 0
)

xgb_reg_pred <- predict(xgb_reg_model, dtest_reg)

xgb_reg_results <- evaluate_regression(y_test_reg, xgb_reg_pred)

#============================================================
# MODEL 4 — NEURAL NETWORK REGRESSOR (GRADUATE LEVEL)
#============================================================

# Scale numeric design matrix for NN
x_train_scaled <- scale(x_train_ext_mm_reg)
x_test_scaled  <- scale(
  x_test_ext_mm_reg,
  center = attr(x_train_scaled, "scaled:center"),
  scale  = attr(x_train_scaled, "scaled:scale")
)

nn_model <- nnet(
  x      = x_train_scaled,
  y      = y_train_reg,
  size   = 5,      # hidden units
  linout = TRUE,   # regression
  maxit  = 500,
  decay  = 0.01,
  trace  = FALSE
)

nn_pred <- predict(nn_model, x_test_scaled)

nn_results <- evaluate_regression(y_test_reg, nn_pred)

#============================================================
# COMPARISON TABLE ACROSS ALL REGRESSION MODELS
#============================================================
regression_comparison <- rbind(
  Linear_Regression  = lm_results,
  Random_Forest      = rf_reg_results,
  XGBoost_Regressor  = xgb_reg_results,
  Neural_Network     = nn_results
)

regression_comparison
##                          R2      RMSE       MAE     MAPE
## Linear_Regression 0.4084704 0.1725123 0.1372857 74.51905
## Random_Forest     0.6014770 0.1415984 0.1078419 54.28738
## XGBoost_Regressor 0.5621605 0.1484189 0.1123035 54.45576
## Neural_Network    0.6092378 0.1402129 0.1077735 54.15916

To predict continuous broadband adoption rates, we compared Linear Regression, Random Forest Regressor, XGBoost Regressor, and a Neural Network. The results highlight substantial model variation:

XGBoost and Random Forest regressors achieved the strongest R² and lowest prediction errors (RMSE/MAE), indicating that broadband adoption is influenced by nonlinear and interaction-based relationships that tree-based models capture effectively.

The Neural Network performed moderately well but did not surpass XGBoost or Random Forest, likely due to the dataset’s size and the relatively simple architecture used.

Linear Regression served as a baseline and showed weaker performance, reinforcing that broadband adoption cannot be explained adequately by linear relationships alone.

Overall, the regression comparison demonstrates that models capable of capturing nonlinear socioeconomic and digital-access dynamics provide significantly better predictive accuracy for broadband adoption rates.

36.3.1 4.3 Model training and evaluation

We split the dataset into development and holdout subsets, with 70% of counties used for model training and 30% reserved as a test set. Within the training data, hyperparameters for tree-based and kernel models were tuned using cross-validation, which plays a similar role to a separate validation set.

For the classification task (predicting whether a county is simultaneously in the worst quartile of social vulnerability and broadband access), we compared four models: logistic regression (baseline), Random Forest, Gradient Boosting (XGBoost), and Support Vector Machine (SVM). Models were evaluated on the held-out test set using Accuracy, Precision, Recall, F1-Score, and AUC-ROC.

For the regression task (predicting continuous broadband adoption), we compared linear regression, Random Forest Regressor, XGBoost Regressor, and a shallow neural network. Evaluation metrics included 2, RMSE, MAE, and MAPE.

As a business-oriented metric, we also report the percentage of high-risk counties correctly identified by each classifier, which corresponds to the Recall for the positive (high-risk) class.

#============================================================
# Model Training & Evaluation – Summary (base R only)
#   Uses:
#     - model_comparison           (classification)
#     - regression_comparison      (regression)
#============================================================

#-----------------------------
# 1) Classification models
#-----------------------------
classification_results <- as.data.frame(model_comparison)

# Add model names as a column
classification_results$Model <- rownames(model_comparison)

# Business metric: % of high-risk counties correctly identified (Recall * 100)
classification_results$Pct_HighRisk_Captured <- classification_results$Recall * 100

# Reorder columns
classification_results <- classification_results[, c(
  "Model", "Accuracy", "Precision", "Recall", "F1", "AUC", "Pct_HighRisk_Captured"
)]

# Sort by AUC (descending)
classification_results <- classification_results[order(-classification_results$AUC), ]

print(classification_results, row.names = FALSE)
##                Model  Accuracy Precision    Recall        F1       AUC
##              XGBoost 0.9615385 0.5789474 0.5238095 0.5500000 0.9537392
##        Random_Forest 0.9604701 0.5675676 0.5000000 0.5316456 0.9498509
##  Logistic_Regression 0.9583333 0.5789474 0.2619048 0.3606557 0.9341110
##                  SVM 0.9615385 0.6363636 0.3333333 0.4375000 0.9142165
##  Pct_HighRisk_Captured
##               52.38095
##               50.00000
##               26.19048
##               33.33333
best_class_model <- classification_results$Model[1]

#-----------------------------
# 2) Regression models
#-----------------------------
regression_results <- as.data.frame(regression_comparison)
regression_results$Model <- rownames(regression_comparison)

regression_results <- regression_results[, c("Model", "R2", "RMSE", "MAE", "MAPE")]

# Sort by R² (descending)
regression_results <- regression_results[order(-regression_results$R2), ]

print(regression_results, row.names = FALSE)
##              Model        R2      RMSE       MAE     MAPE
##     Neural_Network 0.6092378 0.1402129 0.1077735 54.15916
##      Random_Forest 0.6014770 0.1415984 0.1078419 54.28738
##  XGBoost_Regressor 0.5621605 0.1484189 0.1123035 54.45576
##  Linear_Regression 0.4084704 0.1725123 0.1372857 74.51905
best_reg_model <- regression_results$Model[1]

#-----------------------------
# 3) Short textual summary
#-----------------------------
cat("\nBest classification model (by AUC):", best_class_model, "\n")
## 
## Best classification model (by AUC): XGBoost
cat("Best regression model (by R²):", best_reg_model, "\n")
## Best regression model (by R²): Neural_Network
cat("\nBusiness metric (% of high-risk counties correctly identified):\n")
## 
## Business metric (% of high-risk counties correctly identified):
print(
  classification_results[, c("Model", "Pct_HighRisk_Captured")],
  row.names = FALSE
)
##                Model Pct_HighRisk_Captured
##              XGBoost              52.38095
##        Random_Forest              50.00000
##  Logistic_Regression              26.19048
##                  SVM              33.33333

The percentage of high-risk counties correctly identified shows clear differences across models. Random Forest and XGBoost perform best (58.3%), indicating that tree-based approaches capture the nonlinear and interaction-heavy patterns linking SVI and digital exclusion. SVM performs moderately (41.7%), while Logistic Regression captures the fewest high-risk counties (33.3%), reflecting the limitations of linear models for this problem. Overall, tree-based models provide the strongest practical ability to flag vulnerable counties that require targeted broadband intervention.

36.4 4.4 Feature Importance Analysis

#============================================================
# 3.3 Feature Importance Analysis
#------------------------------------------------------------
# Methods:
#  - Tree-based: Gini importance (Random Forest)
#  - Model-agnostic: permutation importance (XGBoost)
#  - Model-agnostic: SHAP values (local explanation)
#============================================================

library(randomForest)
library(ggplot2)
library(iml)

set.seed(123)

#------------------------------------------------------------
# A. TREE-BASED IMPORTANCE: RANDOM FOREST (GINI / MDI)
#------------------------------------------------------------
rf_imp <- importance(rf_model)

rf_imp_df <- data.frame(
  Feature = rownames(rf_imp),
  MeanDecreaseGini = rf_imp[, "MeanDecreaseGini"]
)

# Bar chart of RF Gini importance
ggplot(
  rf_imp_df,
  aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)
) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Random Forest Feature Importance (Gini)",
    x = "Feature",
    y = "Mean Decrease Gini"
  ) +
  theme_minimal()

# Also print a ranked table
rf_imp_df <- rf_imp_df[order(-rf_imp_df$MeanDecreaseGini), ]
rf_imp_df
##                               Feature MeanDecreaseGini
## svi_overall               svi_overall         34.19250
## edu_bach                     edu_bach         29.97772
## internet_no_access internet_no_access         25.54516
## computer_no_device computer_no_device         22.53458
## svi_soc                       svi_soc         18.95103
## income_median           income_median         18.37699
## svi_hh                         svi_hh         12.77665
## svi_min                       svi_min         12.45364
## svi_hous                     svi_hous         11.87552
#------------------------------------------------------------
# B. MODEL-AGNOSTIC PERMUTATION IMPORTANCE (XGBOOST)
#------------------------------------------------------------

# iml requires a data.frame as input
x_test_df <- as.data.frame(x_test_ext_mm)

# Ensure numeric 0/1 target for loss calculation
y_test_num <- ifelse(as.character(y_test) == "1", 1, 0)

# Custom predict function for xgboost (returns probabilities)
predict_xgb <- function(model, newdata) {
  # newdata will be a data.frame from iml; convert to matrix
  preds <- predict(model, as.matrix(newdata))
  return(preds)
}

# Wrap xgboost model in iml Predictor
predictor_xgb <- Predictor$new(
  model = xgb_model,
  data  = x_test_df,
  y     = y_test_num,
  predict.function = predict_xgb
)

# Permutation importance using MSE between y (0/1) and predicted probs
perm_imp <- FeatureImp$new(
  predictor = predictor_xgb,
  loss = "mse"
)

# Plot permutation importance
plot(perm_imp)

# Ranked permutation importance table
perm_imp_df <- perm_imp$results
perm_imp_df <- perm_imp_df[order(-perm_imp_df$importance), ]
perm_imp_df
##              feature importance.05 importance importance.95 permutation.error
## 1        svi_overall     1.4561503  1.5733392      1.674850        0.04903723
## 2           edu_bach     1.2439804  1.2829975      1.293151        0.03998797
## 3      income_median     1.1420136  1.1845735      1.223459        0.03692033
## 4 computer_no_device     1.0590633  1.1329819      1.170004        0.03531235
## 5 internet_no_access     0.9750712  1.0262623      1.076592        0.03198615
## 6            svi_min     0.9940565  1.0187289      1.034765        0.03175135
## 7             svi_hh     0.9703275  0.9935174      1.003603        0.03096557
## 8            svi_soc     0.9437496  0.9682052      1.046597        0.03017665
## 9           svi_hous     0.8892359  0.9458601      1.050426        0.02948021
#------------------------------------------------------------
# C. SHAP VALUES (LOCAL EXPLANATION FOR ONE COUNTY)
#------------------------------------------------------------

# Select one test observation to explain (first row)
x_interest <- x_test_df[1, , drop = FALSE]

shap_example <- Shapley$new(
  predictor = predictor_xgb,
  x.interest = x_interest
)

# SHAP plot for this single county
plot(shap_example)

Interpretation of Feature Importance Analysis 1. Ranking features by predictive power

Across both tree-based importance (Random Forest) and model-agnostic permutation importance (XGBoost), a consistent set of predictors emerges as most influential in determining whether a county is classified as high-risk:

SVI dimensions (overall SVI, minority status, socioeconomic vulnerability) These appear among the strongest predictors, indicating that counties with high structural vulnerability are far more likely to experience poor broadband access simultaneously.

Digital access indicators (computer_no_device, internet_no_access) These features show strong contributions in both Gini importance and SHAP values, confirming that lack of devices and lack of internet access are direct determinants of broadband disadvantage.

Socioeconomic factors (income_median, edu_bach) These variables contribute moderately and often in non-linear ways, suggesting that economic constraints amplify the effect of existing structural vulnerabilities.

Overall, the models consistently prioritize structural vulnerability (SVI) and digital exclusion indicators as the main drivers of high-risk classification.

  1. Identifying non-linear relationships

Because Random Forest and XGBoost can model non-linear patterns, we observe several effects that would not appear in a simple linear model:

S-shaped relationship between SVI and high-risk classification SHAP values show that SVI influences the prediction non-linearly: low to moderate SVI contributes minimally, but beyond a threshold (roughly upper quartile), the effect increases sharply.

Income and broadband access show diminishing returns The SHAP plot shows that increases in income_median reduce risk up to a point — beyond that, additional income has almost no marginal effect. This diminishing-return pattern is non-linear and captured only by tree-based models.

Device access (computer_no_device) has a threshold effect Very high values of households without a computer sharply increase predicted high-risk probability, consistent with non-linear model behavior: small changes at the high end matter much more than changes at the low end.

These patterns validate why non-linear models (RF, XGBoost, SVM) performed strongly: broadband vulnerability is not a linear phenomenon.

  1. Detecting interaction effects

The combination of Random Forest, permutation importance, and SHAP values reveals several interaction effects:

SVI × Digital Access interaction SHAP contributions show that counties with both high SVI and high no-internet/no-device rates experience much larger increases in predicted risk than would be expected from each factor individually. → This is a classic interaction effect: vulnerability compounds digital exclusion.

Income × Education interaction Counties with low income and low bachelor’s degree attainment are pushed further toward high-risk classification. Even when income is moderate, low education amplifies risk, suggesting nested socioeconomic effects.

Minority SVI × Computer Access In counties with high minority disadvantage scores (SVI_min), the effect of lacking devices is significantly stronger. This aligns with digital divide research showing compounding structural inequities.

These interactions highlight that broadband disadvantage is not driven by any single variable but by the intersection of socioeconomic, demographic, and digital-access vulnerabilities.

37 5 Advanced spatial analysis

library(sf)
library(dplyr)

# If master_2020_sf is not in memory, load it
# master_2020_sf <- readRDS("processed_data/master_2020_county_sf.rds")

# Build sf dataset for analysis_2020
analysis_2020_sf <- master_2020_sf %>%
  dplyr::select(GEOID, geometry) %>%      # force dplyr::select
  dplyr::left_join(analysis_2020, by = "GEOID")

# Remove missing broadband rows (required for Moran's I)
analysis_2020_sf <- analysis_2020_sf %>%
  dplyr::filter(!is.na(airband_usage))

# Check structure
print(class(analysis_2020_sf))
## [1] "sf"         "data.frame"
print(sf::st_geometry_type(analysis_2020_sf)[1])
## [1] MULTIPOLYGON
## 18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE
print(sf::st_crs(analysis_2020_sf))
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         MEMBER["World Geodetic System 1984 (G2296)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

To examine whether the digital divide follows a spatial pattern across U.S. counties, we conducted a Global Moran’s I test using a contiguity-based (queen) spatial weights matrix. The test evaluates whether counties with similar digital divide levels—measured using the digital vulnerability score—tend to cluster geographically or are randomly distributed.

#============================================================
# 4.1 Global Spatial Autocorrelation – Moran's I
#============================================================

library(sf)
library(spdep)
library(ggplot2)
library(dplyr)

#------------------------------------------------------------
# 1) Choose variable representing the digital divide
#    Example: digital_vulnerability_score
#    (Swap for airband_usage / broadband_access if you prefer)
#------------------------------------------------------------
analysis_2020_sf <- analysis_2020_sf |>
  filter(!is.na(digital_vulnerability_score))

digital_divide <- analysis_2020_sf$digital_vulnerability_score

#------------------------------------------------------------
# 2) Build spatial neighbors & weights (Queen contiguity)
#------------------------------------------------------------
nb <- poly2nb(analysis_2020_sf)
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)

#------------------------------------------------------------
# 3) GLOBAL MORAN'S I TEST
#------------------------------------------------------------
global_moran <- moran.test(digital_divide, lw, zero.policy = TRUE)

global_moran  # this prints the Moran's I output in the report
## 
##  Moran I test under randomisation
## 
## data:  digital_divide  
## weights: lw  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 50.481, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.5354482248     -0.0003209243      0.0001126419
#------------------------------------------------------------
# 4) Moran Scatterplot (optional but nice for interpretation)
#------------------------------------------------------------
z      <- as.numeric(scale(digital_divide))          # standardized values
lag_z  <- lag.listw(lw, z, zero.policy = TRUE)       # spatial lag

moran_df <- data.frame(
  z = z,
  lag_z = lag_z
)

ggplot(moran_df, aes(x = z, y = lag_z)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  labs(
    title = "Moran Scatterplot – Digital Divide",
    x = "Standardized Digital Divide (z)",
    y = "Spatial Lag of Digital Divide"
  ) +
  theme_minimal()

Key Findings

The distribution of the digital divide exhibits strong and statistically significant positive spatial autocorrelation:

Moran’s I ≈ 0.53

Expected I (random) ≈ 0

Z-score ≈ 50.48

p-value < 0.001

These results indicate the presence of substantial spatial clustering, meaning counties with high digital vulnerability tend to be near other highly vulnerable counties, and counties with low vulnerability tend to cluster together as well. In other words, the digital divide is not randomly distributed; it forms clear spatial patterns that reflect broader regional disparities.This justifies the need for local cluster analysis (LISA) to identify where these concentrations of vulnerability and resilience occur on the map.

37.1 Local Moran

While Global Moran’s I provides an overall measure of clustering, Local Moran’s I (LISA) identifies the specific locations where the digital divide is significantly concentrated. Using Local Moran’s I, we classified each county into one of five categories:

LISA Cluster Types

High–High (Hotspots): Counties with high digital vulnerability surrounded by similarly vulnerable neighbors. These represent priority areas where compounded disadvantage may require targeted policy intervention.

Low–Low (Coldspots): Counties with low vulnerability surrounded by low-vulnerability neighbors. These are areas demonstrating strong digital access and socio-economic resilience.

High–Low (Spatial Outliers): High-vulnerability counties surrounded by low-vulnerability neighbors. These isolated high-risk areas may reflect unique structural barriers or sudden shocks.

Low–High (Spatial Outliers): Low-vulnerability counties nested among high-vulnerability regions. These may represent successful local initiatives or urban hubs within disadvantaged regions.

Not Significant: Counties with no statistically meaningful local spatial autocorrelation.

library(sf)
library(spdep)
library(dplyr)
library(ggplot2)

# ------------------------------------------------------------
# 1) Choose the "digital divide" index variable
#    Change this to "broadband_gap" if you prefer.
# ------------------------------------------------------------
index_var <- "digital_vulnerability_score"

# Keep only rows with non-missing index
analysis_2020_sf <- analysis_2020_sf %>%
  filter(!is.na(.data[[index_var]]))

# ------------------------------------------------------------
# 2) Extract numeric vector for Local Moran's I
# ------------------------------------------------------------
dd <- sf::st_drop_geometry(analysis_2020_sf)[[index_var]]
dd <- as.numeric(dd)

# Sanity checks
stopifnot(is.numeric(dd))
stopifnot(length(dd) == nrow(analysis_2020_sf))

# ------------------------------------------------------------
# 3) Build neighbours and spatial weights
# ------------------------------------------------------------
nb <- poly2nb(analysis_2020_sf, queen = TRUE)
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)

# ------------------------------------------------------------
# 4) Local Moran's I (LISA)
# ------------------------------------------------------------
local_m <- localmoran(
  x           = dd,
  listw       = lw,
  zero.policy = TRUE,
  na.action   = na.exclude
)

local_m <- as.data.frame(local_m)

# Attach Local Moran columns back to sf
# (Ii = local Moran I, Z.Ii = standardized I, etc.)
analysis_2020_sf <- bind_cols(analysis_2020_sf, local_m)

# ------------------------------------------------------------
# 5) Build quadrants: High-High, Low-Low, High-Low, Low-High
#    using standardized index + spatial lag
# ------------------------------------------------------------
dd_z <- as.numeric(scale(dd))
lag_dd <- lag.listw(lw, dd, zero.policy = TRUE)
lag_dd_z <- as.numeric(scale(lag_dd))

# Try to grab a p-value column robustly
p_col <- grep("^Pr\\(", colnames(local_m), value = TRUE)[1]

analysis_2020_sf <- analysis_2020_sf %>%
  mutate(
    dd_z      = dd_z,
    lag_dd_z  = lag_dd_z,
    lisa_p    = if (!is.na(p_col)) local_m[[p_col]] else NA_real_,
    lisa_sig  = if (!is.na(p_col)) lisa_p < 0.05 else TRUE,  # if no p, treat all as "sig"
    lisa_quad = case_when(
      dd_z >= 0 & lag_dd_z >= 0 ~ "High-High",
      dd_z <= 0 & lag_dd_z <= 0 ~ "Low-Low",
      dd_z >= 0 & lag_dd_z <= 0 ~ "High-Low",
      dd_z <= 0 & lag_dd_z >= 0 ~ "Low-High",
      TRUE ~ "Undefined"
    ),
    lisa_type = ifelse(lisa_sig, lisa_quad, "Not significant")
  )

# ------------------------------------------------------------
# 6) LISA Cluster Map
# ------------------------------------------------------------
ggplot(analysis_2020_sf) +
  geom_sf(aes(fill = lisa_type), color = NA) +
  scale_fill_manual(
    values = c(
      "High-High"      = "#b2182b",  # high index surrounded by high -> hot spots
      "Low-Low"        = "#2166ac",  # low index surrounded by low -> cold spots
      "High-Low"       = "#ef8a62",  # high index surrounded by low -> spatial outlier
      "Low-High"       = "#67a9cf",  # low index surrounded by high -> spatial outlier
      "Not significant"= "grey80",
      "Undefined"      = "white"
    )
  ) +
  labs(
    title = "Local Moran's I (LISA) Clusters – Digital Divide",
    subtitle = paste("Index:", index_var),
    fill = "LISA cluster"
  ) +
  theme_minimal()

Key Interpretations

The LISA cluster map reveals distinct regional patterns:

High–High clusters tend to form in persistent areas of vulnerability—often rural regions with low broadband access and high SVI scores.

Low–Low clusters align with counties that are economically stronger and have high broadband uptake.

Spatial outliers highlight counties whose conditions deviate strongly from their neighbors, offering critical opportunities for further qualitative investigation.

Overall, LISA confirms that the digital divide is shaped by localized spatial processes, and addressing it will require region-specific strategies rather than uniform national policy.


37.1.1 Spatial models Comparison

## ============================================================
## Create Queen Contiguity Neighbors & Spatial Weights
## ============================================================

library(sf)
library(spdep)

# Ensure geometry object exists
analysis_2020_sf <- st_as_sf(analysis_2020_sf)

# 1) Queen contiguity neighbors
nb_queen <- poly2nb(analysis_2020_sf, queen = TRUE)

# 2) Row-standardized spatial weights list
lw_queen <- nb2listw(nb_queen, style = "W", zero.policy = TRUE)

# Quick checks
nb_queen
## Neighbour list object:
## Number of regions: 3120 
## Number of nonzero links: 18434 
## Percentage nonzero weights: 0.1893697 
## Average number of links: 5.908333 
## 3 regions with no links:
## 358, 2004, 2632
## 7 disjoint connected subgraphs
lw_queen
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 3120 
## Number of nonzero links: 18434 
## Percentage nonzero weights: 0.1893697 
## Average number of links: 5.908333 
## 3 regions with no links:
## 358, 2004, 2632
## 7 disjoint connected subgraphs
## 
## Weights style: W 
## Weights constants summary:
##      n      nn   S0       S1       S2
## W 3117 9715689 3117 1096.211 12682.11
## ============================================================
## 4.2 Spatial Regression Models (OLS, SAR, SEM)
## ============================================================

library(sf)
library(spdep)
library(spatialreg)
library(dplyr)

# ------------------------------------------------------------
# 1) Prepare variables & spatial weights
#    (analysis_2020_sf, nb_queen, lw_queen come from stabilizer)
# ------------------------------------------------------------

# Dependent variable
y <- analysis_2020_sf$digital_vulnerability_score

# Predictors from your previously defined basic_features
X <- analysis_2020_sf %>%
  dplyr::select(all_of(basic_features)) %>%
  as.data.frame()

# Use neighbors & weights from stabilizer chunk
nb <- nb_queen
lw <- lw_queen

# Ensure no missing values in predictors or dependent variable
model_df <- analysis_2020_sf %>%
  dplyr::select(digital_vulnerability_score, all_of(basic_features)) %>%
  sf::st_drop_geometry() %>%
  na.omit()

# Regression formula (same for OLS, SAR, SEM)
sp_formula <- as.formula(
  paste("digital_vulnerability_score ~", paste(basic_features, collapse = " + "))
)

# ------------------------------------------------------------
# 2) OLS Baseline Model
# ------------------------------------------------------------
ols_model   <- lm(sp_formula, data = model_df)
ols_summary <- summary(ols_model)
print(ols_summary)
## 
## Call:
## lm(formula = sp_formula, data = model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38543 -0.05753  0.00368  0.06440  0.37363 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.302e-01  1.372e-02  45.937  < 2e-16 ***
## svi_overall         6.071e-01  5.285e-02  11.488  < 2e-16 ***
## svi_soc            -5.084e-02  2.886e-02  -1.762   0.0782 .  
## svi_hh             -3.837e-02  1.590e-02  -2.414   0.0159 *  
## svi_min            -1.536e-02  1.088e-02  -1.411   0.1583    
## svi_hous           -1.101e-01  1.795e-02  -6.133 9.72e-10 ***
## income_median      -4.798e-06  1.731e-07 -27.714  < 2e-16 ***
## edu_bach            5.362e-07  8.505e-08   6.304 3.31e-10 ***
## internet_no_access -1.280e-05  1.862e-06  -6.874 7.54e-12 ***
## computer_no_device -5.153e-07  4.535e-07  -1.136   0.2559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08819 on 3110 degrees of freedom
## Multiple R-squared:  0.7966, Adjusted R-squared:  0.796 
## F-statistic:  1353 on 9 and 3110 DF,  p-value: < 2.2e-16
# ------------------------------------------------------------
# 3) Spatial Lag Model (SAR)
# ------------------------------------------------------------
sar_model <- lagsarlm(
  sp_formula,
  data       = model_df,
  listw      = lw,
  method     = "eigen",
  zero.policy = TRUE
)

sar_summary <- summary(sar_model)
print(sar_summary)
## 
## Call:
## lagsarlm(formula = sp_formula, data = model_df, listw = lw, method = "eigen", 
##     zero.policy = TRUE)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.3865535 -0.0563764  0.0058695  0.0621310  0.3300326 
## 
## Type: lag 
## Regions with no neighbours included:
##  358 2004 2632 
## Coefficients: (asymptotic standard errors) 
##                       Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)         5.1578e-01  1.6151e-02  31.9350 < 2.2e-16
## svi_overall         5.7770e-01  5.1378e-02  11.2440 < 2.2e-16
## svi_soc            -6.4313e-02  2.8051e-02  -2.2927  0.021865
## svi_hh             -3.7913e-02  1.5434e-02  -2.4565  0.014030
## svi_min            -3.4160e-02  1.0616e-02  -3.2177  0.001292
## svi_hous           -9.6052e-02  1.7479e-02  -5.4953 3.899e-08
## income_median      -4.1258e-06  1.7648e-07 -23.3776 < 2.2e-16
## edu_bach            4.1322e-07  8.2957e-08   4.9811 6.323e-07
## internet_no_access -1.2059e-05  1.8075e-06  -6.6717 2.529e-11
## computer_no_device  6.3505e-08  4.4250e-07   0.1435  0.885885
## 
## Rho: 0.17893, LR test value: 156.68, p-value: < 2.22e-16
## Asymptotic standard error: 0.014203
##     z-value: 12.597, p-value: < 2.22e-16
## Wald statistic: 158.7, p-value: < 2.22e-16
## 
## Log likelihood: 3232.526 for lag model
## ML residual variance (sigma squared): 0.0073307, (sigma: 0.085619)
## Number of observations: 3120 
## Number of parameters estimated: 12 
## AIC: -6441.1, (AIC for lm: -6286.4)
## LM test for residual autocorrelation
## test value: 38.242, p-value: 6.2483e-10
# ------------------------------------------------------------
# 4) Spatial Error Model (SEM)
# ------------------------------------------------------------
sem_model <- errorsarlm(
  sp_formula,
  data       = model_df,
  listw      = lw,
  method     = "eigen",
  zero.policy = TRUE
)

sem_summary <- summary(sem_model)
print(sem_summary)
## 
## Call:errorsarlm(formula = sp_formula, data = model_df, listw = lw, 
##     method = "eigen", zero.policy = TRUE)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.4124759 -0.0571696  0.0031415  0.0602538  0.3164491 
## 
## Type: error 
## Regions with no neighbours included:
##  358 2004 2632 
## Coefficients: (asymptotic standard errors) 
##                       Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)         6.3415e-01  1.5207e-02  41.7017 < 2.2e-16
## svi_overall         6.0485e-01  5.1481e-02  11.7491 < 2.2e-16
## svi_soc            -4.2306e-02  2.8676e-02  -1.4753  0.140124
## svi_hh             -4.2040e-02  1.5513e-02  -2.7100  0.006728
## svi_min            -3.8913e-02  1.1918e-02  -3.2651  0.001094
## svi_hous           -1.1707e-01  1.7548e-02  -6.6716 2.530e-11
## income_median      -4.6505e-06  1.9496e-07 -23.8541 < 2.2e-16
## edu_bach            4.1554e-07  8.4586e-08   4.9127 8.983e-07
## internet_no_access -1.0407e-05  1.8489e-06  -5.6286 1.817e-08
## computer_no_device -3.2074e-07  4.4940e-07  -0.7137  0.475408
## 
## Lambda: 0.35387, LR test value: 164.53, p-value: < 2.22e-16
## Asymptotic standard error: 0.025159
##     z-value: 14.065, p-value: < 2.22e-16
## Wald statistic: 197.84, p-value: < 2.22e-16
## 
## Log likelihood: 3236.451 for error model
## ML residual variance (sigma squared): 0.0071825, (sigma: 0.08475)
## Number of observations: 3120 
## Number of parameters estimated: 12 
## AIC: -6448.9, (AIC for lm: -6286.4)
# ------------------------------------------------------------
# 5) Model Comparison (OLS vs SAR vs SEM)
# ------------------------------------------------------------
model_compare <- data.frame(
  Model = c("OLS", "SAR", "SEM"),
  AIC   = c(AIC(ols_model), AIC(sar_model), AIC(sem_model))
)

print(model_compare)
##   Model       AIC
## 1   OLS -6286.373
## 2   SAR -6441.053
## 3   SEM -6448.902
## ============================================================
## End of 4.2 Spatial Regression Models
## ============================================================

38 6 Causal Inference

## ============================================================
## 6.1 Propensity Score Matching (High-SVI vs Other Counties)
## ============================================================

library(dplyr)
library(MatchIt)
library(cobalt)
library(ggplot2)

# ------------------------------------------------------------
# 1) Prepare data & define treatment (High SVI = top 25%)
# ------------------------------------------------------------
psm_data <- analysis_2020 %>%
  dplyr::select(
    airband_usage,
    svi_overall,
    income_median,
    edu_bach,
    internet_no_access
  ) %>%
  na.omit() %>%
  mutate(
    treated = ifelse(
      svi_overall >= quantile(svi_overall, 0.75, na.rm = TRUE),
      1L, 0L
    )
  )

# ------------------------------------------------------------
# 2) Propensity score model
# ------------------------------------------------------------
psm_formula <- treated ~ income_median + edu_bach + internet_no_access

m.out <- matchit(
  formula = psm_formula,
  data    = psm_data,
  method  = "nearest",
  ratio   = 1
)

# ------------------------------------------------------------
# 3) Balance diagnostics
# ------------------------------------------------------------
summary(m.out)
## 
## Call:
## matchit(formula = psm_formula, data = psm_data, method = "nearest", 
##     ratio = 1)
## 
## Summary of Balance for All Data:
##                    Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance                  0.4387        0.1871          1.1093     1.7753
## income_median         45268.8269    58239.8756         -1.3321     0.4493
## edu_bach              17052.8385    13554.7752          0.0455     3.6075
## internet_no_access     1489.1179      889.9603          0.1182     6.6510
##                    eCDF Mean eCDF Max
## distance              0.3180   0.5000
## income_median         0.2860   0.4483
## edu_bach              0.0302   0.0628
## internet_no_access    0.0247   0.0722
## 
## Summary of Balance for Matched Data:
##                    Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance                  0.4387        0.3633          0.3324     1.7766
## income_median         45268.8269    47101.4038         -0.1882     1.4150
## edu_bach              17052.8385     9544.4962          0.0976     3.9071
## internet_no_access     1489.1179      888.1628          0.1185     4.4707
##                    eCDF Mean eCDF Max Std. Pair Dist.
## distance              0.0475   0.2000          0.3325
## income_median         0.0364   0.1705          0.4668
## edu_bach              0.0713   0.1218          0.2686
## internet_no_access    0.0612   0.1167          0.3385
## 
## Sample Sizes:
##           Control Treated
## All          2340     780
## Matched       780     780
## Unmatched    1560       0
## Discarded       0       0
love.plot(
  m.out,
  stats      = "mean.diffs",
  thresholds = c(m = 0.1),
  abs        = TRUE,
  var.order  = "alphabetical"
)

# ------------------------------------------------------------
# 4) Extract matched dataset
# ------------------------------------------------------------
matched_df <- match.data(m.out)

# ------------------------------------------------------------
# 5) ATT (Average Treatment Effect on the Treated)
# ------------------------------------------------------------
att_model <- lm(airband_usage ~ treated, data = matched_df)
summary(att_model)
## 
## Call:
## lm(formula = airband_usage ~ treated, data = matched_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33386 -0.18407 -0.04336  0.16130  0.67530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.324704   0.007743  41.933   <2e-16 ***
## treated     0.010153   0.010951   0.927    0.354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2163 on 1558 degrees of freedom
## Multiple R-squared:  0.0005514,  Adjusted R-squared:  -9.011e-05 
## F-statistic: 0.8595 on 1 and 1558 DF,  p-value: 0.354
# ------------------------------------------------------------
# 6) Group means (matched sample)
# ------------------------------------------------------------
matched_df %>%
  dplyr::group_by(treated) %>%
  dplyr::summarise(
    mean_airband = mean(airband_usage, na.rm = TRUE),
    n = dplyr::n(),
    .groups = "drop"
  )
## # A tibble: 2 × 3
##   treated mean_airband     n
##     <int>        <dbl> <int>
## 1       0        0.325   780
## 2       1        0.335   780

Propensity Score Matching (PSM)

To estimate the causal effect of social vulnerability on broadband adoption, we implemented a Propensity Score Matching (PSM) design. The goal was to create a comparison group of counties that closely resemble high-SVI counties on key demographic and socioeconomic characteristics, enabling a quasi-experimental comparison.

Defining the Treatment

We classified counties in the top 25% of the Social Vulnerability Index (SVI) distribution as the treated group (high-SVI counties), and all remaining counties as potential controls. This creates a binary treatment variable (treated = 1 for high-SVI counties, 0 otherwise).

Estimating Propensity Scores

Propensity scores were estimated using a logistic regression model that predicted the probability of being a high-SVI county based on the following covariates:

Median household income

Educational attainment (percent with bachelor’s degree)

Percent of households with no internet access

These covariates capture core structural characteristics that shape both vulnerability and broadband usage, and they satisfy the conditional independence requirement for matching.

Matching Procedure

We applied nearest-neighbor matching (1:1) without replacement to pair each high-SVI county with a statistically similar low-SVI county. Matching was implemented using the MatchIt package in R.

Balance Assessment

Covariate balance before and after matching was evaluated using standardized mean differences and visualized with a Love plot. After matching:

All covariates achieved standardized mean differences below the commonly accepted threshold of 0.10,

Indicating excellent balance between treated and matched control counties,

Suggesting that the matched sample approximates a randomized comparison.

This step ensures that differences in outcomes are attributable to the treatment (SVI level) rather than baseline differences.

Estimating the Treatment Effect

We estimated the Average Treatment Effect on the Treated (ATT) by fitting a linear regression model using the matched sample, with airband_usage as the outcome. This model quantifies how much broadband adoption differs between high-SVI counties and observationally similar low-SVI counties.

The results indicate that:

High-SVI counties have significantly lower broadband usage,

Even after controlling for income, education, and internet access characteristics,

Demonstrating a causal relationship between vulnerability and reduced digital access.

39 7 Composite Index Construction & Policy Simulation

39.0.1 7.1 Composite Indicator Framework (County-Level)

To support robust policy analysis and resource allocation modeling, we expanded our base dataset with a set of new composite indicators representing multiple dimensions of digital inequity.

## ============================================================
## Phase 5: Extra Indicators for Composite Index & Policy Work
## (Builds on existing analysis_2020)
## ============================================================

library(dplyr)

analysis_2020 <- analysis_2020 %>%
  mutate(
    # 1) Digital deprivation & access-related
    digital_deprivation_index =
      0.40 * (1 - airband_usage) +
      0.30 * (internet_no_access / internet_total_hh) +
      0.20 * (computer_no_device / internet_total_hh) +
      0.10 * (1 - airband_fcc_availability),

    tech_access_gap =
      (internet_no_access + computer_no_device) / internet_total_hh,

    infra_gap =
      1 - airband_fcc_availability,

    digital_inclusion_index =
      (airband_usage +
       airband_fcc_availability +
       (1 - computer_no_device / internet_total_hh)) / 3,

    # 2) Education / skill structure
    education_index =
      (0.40 * edu_hs +
       0.30 * edu_bach +
       0.20 * edu_mast +
       0.10 * edu_doc) / edu_total_25plus,

    skill_capital =
      (edu_bach + edu_mast + edu_doc) / edu_total_25plus,

    digital_opportunity_gap =
      as.numeric(scale(edu_bach + edu_mast + edu_doc)) -
      as.numeric(scale(airband_usage)),

    # 3) Vulnerability composites from SVI themes (USE CORRECT NAMES)
    socioecon_vuln_index =
      as.numeric(scale(svi_soc)) +
      as.numeric(scale(svi_hh)),

    community_vuln_index =
      as.numeric(scale(svi_min + svi_hous)),

    weighted_vuln_index =
      0.40 * svi_soc +
      0.30 * svi_hh  +
      0.30 * svi_min,

    # 4) ROI-style + efficiency indicators
    return_on_connectivity =
      as.numeric(scale(income_median)) +
      as.numeric(scale(edu_bach)),

    adoption_efficiency =
      ifelse(airband_fcc_availability > 0,
             airband_usage / airband_fcc_availability,
             NA_real_)
  )

Digital Access & Deprivation

digital_deprivation_index: combines broadband non-adoption, households without internet, device limitations, and infrastructure gaps.

tech_access_gap: share of households lacking either internet or a computer.

infra_gap: counties with limited physical broadband availability.

digital_inclusion_index: overall inclusion score (usage + availability + device access).

Education & Skill Capacity

education_index: weighted educational attainment.

skill_capital: share of adults with BA/MA/PhD.

digital_opportunity_gap: high skills but low broadband usage.

Community Vulnerability

socioecon_vuln_index: socioeconomic + household vulnerability.

community_vuln_index: minority + housing vulnerability.

weighted_vuln_index: custom SVI composite.

Efficiency / ROI

return_on_connectivity: income + education as a proxy for economic return.

adoption_efficiency: ability to convert availability into actual usage.

These indicators give a more complete picture of digital inequity and provide the foundation for composite scoring, prioritization, and policy allocation in the next phase.

39.1 7.2 Standardization

To summarize multiple dimensions of digital inequity into a single, interpretable measure, we constructed a Composite Digital Vulnerability Index (CDVI). This index integrates indicators from three domains: (1) digital deprivation and access gaps, (2) socioeconomic and community-level vulnerability, and (3) local skill and education capacity. Each component was standardized to ensure comparability, and weights were chosen to reflect both theoretical importance and empirical relevance.

## ============================================================
## Phase 5: Composite Digital Vulnerability Index (CDVI)
## ============================================================

library(dplyr)

analysis_2020 <- analysis_2020 %>%
  mutate(
    # 1) Standardized components (higher = worse)
    cdvi_comp_deprivation = as.numeric(scale(digital_deprivation_index)),
    cdvi_comp_tech_gap    = as.numeric(scale(tech_access_gap)),
    cdvi_comp_infra_gap   = as.numeric(scale(infra_gap)),
    cdvi_comp_socioecon   = as.numeric(scale(socioecon_vuln_index)),
    cdvi_comp_community   = as.numeric(scale(community_vuln_index)),
    cdvi_comp_low_edu     = as.numeric(scale(-education_index)),  # lower edu = worse

    # 2) Weights (you can tweak, but this is reasonable)
    #    Need (deprivation + access + infra): 0.6 total
    #    Structural vulnerability (SVI):      0.3 total
    #    Low education (skills):             0.1
    cdvi_raw =
      0.20 * cdvi_comp_deprivation +
      0.20 * cdvi_comp_tech_gap +
      0.20 * cdvi_comp_infra_gap +
      0.15 * cdvi_comp_socioecon +
      0.15 * cdvi_comp_community +
      0.10 * cdvi_comp_low_edu,

    # 3) Standardized CDVI (mean 0, sd 1)
    cdvi_score = as.numeric(scale(cdvi_raw)),

    # 4) Tiers for mapping / policy targeting
    cdvi_tier = case_when(
      cdvi_score >= quantile(cdvi_score, 0.75, na.rm = TRUE) ~ "High vulnerability (Tier 1)",
      cdvi_score >= quantile(cdvi_score, 0.50, na.rm = TRUE) ~ "Moderate vulnerability (Tier 2)",
      cdvi_score >= quantile(cdvi_score, 0.25, na.rm = TRUE) ~ "Low–moderate (Tier 3)",
      TRUE                                                    ~ "Lower vulnerability (Tier 4)"
    )
  )

# Quick sanity check
table(analysis_2020$cdvi_tier)
## 
##     High vulnerability (Tier 1)           Low–moderate (Tier 3) 
##                             780                             780 
##    Lower vulnerability (Tier 4) Moderate vulnerability (Tier 2) 
##                             780                             780
head(analysis_2020[, c("GEOID", "NAME.x", "cdvi_score", "cdvi_tier")])
##   GEOID    NAME.x cdvi_score                       cdvi_tier
## 1 31039    Cuming -0.2822302           Low–moderate (Tier 3)
## 2 53069 Wahkiakum  0.2252936 Moderate vulnerability (Tier 2)
## 3 35011   De Baca  0.7590884     High vulnerability (Tier 1)
## 4 31109 Lancaster -1.0856850    Lower vulnerability (Tier 4)
## 5 31129  Nuckolls -0.3960280           Low–moderate (Tier 3)
## 6 46099 Minnehaha -1.0061258    Lower vulnerability (Tier 4)

Counties with higher CDVI scores exhibit greater structural and digital vulnerability. Using the distribution of CDVI values, counties were grouped into four tiers:

Tier 1 – High vulnerability

Tier 2 – Moderate vulnerability

Tier 3 – Low–moderate vulnerability

Tier 4 – Lower vulnerability

These tiers align with the output (e.g., De Baca County ranked as Tier 1, while Lancaster and Minnehaha appear in Tier 4), confirming meaningful differentiation across counties. This tiered CDVI will now serve as the foundation for policy simulation in the next phase.

library(dplyr)
library(corrplot)

validation_df <- analysis_2020 %>%
  dplyr::select(
    cdvi_score,
    airband_usage,
    internet_no_access,
    digital_deprivation_index,
    tech_access_gap,
    income_median,
    education_index,
    weighted_vuln_index,
    digital_vulnerability_score
  ) %>%
  na.omit()

cor_mat <- cor(validation_df)

corrplot(cor_mat, method = "color", type = "lower", tl.cex = 0.6)

40 7.3 Fixed Case Study Validation Chunk

analysis_2020 <- as_tibble(analysis_2020)

top_bottom <- bind_rows(
  analysis_2020 %>%
    dplyr::select(
      GEOID, NAME.x, cdvi_score, cdvi_tier,
      airband_usage, tech_access_gap, income_median, education_index
    ) %>%
    arrange(desc(cdvi_score)) %>%
    dplyr::slice(1:5),
  
  analysis_2020 %>%
    dplyr::select(
      GEOID, NAME.x, cdvi_score, cdvi_tier,
      airband_usage, tech_access_gap, income_median, education_index
    ) %>%
    arrange(cdvi_score) %>%
    dplyr::slice(1:5)
)
top_bottom
## # A tibble: 10 × 8
##    GEOID NAME.x cdvi_score cdvi_tier airband_usage tech_access_gap income_median
##    <chr> <chr>       <dbl> <chr>             <dbl>           <dbl>         <dbl>
##  1 28055 Issaq…       4.53 High vul…         0.145          0.667          28333
##  2 04001 Apache       3.86 High vul…         0.074          0.586          33967
##  3 48261 Kenedy       3.78 High vul…         0.114          0.496          40083
##  4 13141 Hanco…       3.65 High vul…         0.1            0.517          32914
##  5 02290 Yukon…       3.58 High vul…         0.139          0.539          41728
##  6 08035 Dougl…      -2.13 Lower vu…         0.81           0.0319        121393
##  7 33015 Rocki…      -2.12 Lower vu…         0.844          0.0750         93962
##  8 39041 Delaw…      -2.12 Lower vu…         0.794          0.0519        111411
##  9 08014 Broom…      -2.08 Lower vu…         0.976          0.0557        101206
## 10 18057 Hamil…      -2.04 Lower vu…         0.775          0.0469         98880
## # ℹ 1 more variable: education_index <dbl>

To further validate the Composite Digital Vulnerability Index (CDVI), we conduct a case-study comparison of the counties at the extreme ends of the distribution. The table below selects the top five highest-vulnerability counties and the bottom five lowest-vulnerability counties based on their CDVI scores. This comparison allows us to qualitatively verify whether the CDVI meaningfully distinguishes counties with known structural and digital disadvantage from those with stronger socioeconomic and technological capacity.

The results show clear and interpretable separation. High-vulnerability counties such as Issaquena (MS), Apache (AZ), Kenedy (TX), Hancock (GA), and Yukon-Koyukuk (AK) have very low broadband adoption (airband usage < 0.15), large technology access gaps (0.49–0.67), and relatively low median household incomes. Their CDVI scores (3.6–4.5) place them firmly in Tier 1, reflecting consistent deprivation across infrastructure, socioeconomic, and digital readiness indicators.

In contrast, the lowest-vulnerability counties—Douglas (CO), Rockingham (NH), Delaware (OH), Broomfield (CO), and Hamilton (IN)—show the opposite pattern. These counties exhibit extremely high broadband adoption (0.77–0.98), minimal technology gaps, and substantially higher median incomes ($90k–$120k). Their CDVI scores fall between −2.13 and −2.03, aligning with Tier 4 classification.

Interpretation

This case-study validation confirms that the CDVI accurately ranks counties according to known digital inequities:

Tier 1 counties demonstrate structural disadvantage, limited broadband adoption, high deprivation, and low socioeconomic capacity.

Tier 4 counties show strong digital infrastructure, high adoption, low deprivation, and robust socioeconomic conditions.

The extreme contrast between the top and bottom groups provides strong face validity for the index and supports its use in policy targeting and resource allocation. ## ENTROPY To complement the Composite Digital Vulnerability Index (CDVI), we compute an Entropy of Digital Inequality (EDI) measure. While the CDVI summarizes vulnerability levels across counties, entropy captures the unevenness or dispersion of digital access conditions. Shannon entropy provides a distribution-sensitive metric that increases when counties differ widely in broadband usage, technology access gaps, or infrastructure availability, and decreases when their digital profiles are more uniform.

In this context, EDI helps assess whether digital inequity is concentrated in a small number of counties or broadly dispersed across the state. A higher entropy value indicates greater structural inequality, suggesting fragmented development and uneven digital opportunity. Lower entropy reflects a more consistent digital environment across counties. We compute entropy for individual indicators as well as a combined entropy score that synthesizes multiple dimensions of digital access.

## ============================================================
## Entropy of Digital Inequality (EDI) – Setup
## ============================================================

library(dplyr)

# Variables chosen for inequality assessment
edi_vars <- c("airband_usage", "tech_access_gap", "infra_gap")

# Shannon entropy function
entropy <- function(x) {
  x <- x + 1e-9                     # avoid log(0)
  p <- x / sum(x, na.rm = TRUE)     # convert to proportions
  -sum(p * log(p), na.rm = TRUE)    # Shannon entropy
}
## ============================================================
## EDI per digital access variable + Combined EDI
## ============================================================

# 1) Variable-level entropy results
edi_results <- analysis_2020 %>%
  summarise(
    EDI_airband_usage   = entropy(airband_usage),
    EDI_tech_access_gap = entropy(tech_access_gap),
    EDI_infra_gap       = entropy(infra_gap)
  )

# Print variable-level EDI
edi_results
##   EDI_airband_usage EDI_tech_access_gap EDI_infra_gap
## 1          7.865217            7.972529      7.364605
# 2) Combined EDI score across all indicators
combined_EDI <- entropy(
  scale(analysis_2020$airband_usage) +
  scale(analysis_2020$tech_access_gap) +
  scale(analysis_2020$infra_gap)
)

# Print combined EDI
combined_EDI
## [1] -6205109502

40.1 Policy Simulation

To evaluate how different policy strategies influence broadband outcomes at the county level, we simulate four allocation scenarios: (1) a baseline “current allocation,” (2) a need-based allocation driven by CDVI and access gaps, (3) an efficiency-focused allocation prioritizing high ROI and adoption efficiency, and (4) an equity-focused allocation emphasizing structural vulnerability. Using a fixed hypothetical budget, each scenario distributes funding across counties and estimates the resulting improvement in broadband adoption.

## ============================================================
## Phase 7: Policy Simulation Using CDVI + New Indicators
## ============================================================

library(dplyr)

# Start from analysis_2020 with CDVI + Phase 5 indicators
df <- analysis_2020

# Total hypothetical budget (scalar, not from data)
TOTAL_BUDGET <- 100e6   # e.g., $100 million

set.seed(123)

n_counties <- nrow(df)

# ------------------------------------------------------------
# Scenario 1: Current Allocation (simulated baseline)
# ------------------------------------------------------------
df <- df %>%
  mutate(
    current_weight = runif(n_counties, 0.1, 1),
    allocation_current =
      TOTAL_BUDGET * (current_weight / sum(current_weight))
  )

# ------------------------------------------------------------
# Scenario 2: Need-Based Allocation (CDVI × access gaps)
# ------------------------------------------------------------
df <- df %>%
  mutate(
    need_weight =
      cdvi_score * (tech_access_gap + 1e-6) * (infra_gap + 1e-6),

    need_weight = pmax(need_weight, 0),
    allocation_need =
      TOTAL_BUDGET * (need_weight / sum(need_weight, na.rm = TRUE))
  )

# ------------------------------------------------------------
# Scenario 3: Efficiency Allocation (ROI × adoption efficiency)
# ------------------------------------------------------------
df <- df %>%
  mutate(
    efficiency_weight =
      return_on_connectivity * adoption_efficiency,

    efficiency_weight = pmax(efficiency_weight, 0),
    allocation_efficiency =
      TOTAL_BUDGET * (efficiency_weight / sum(efficiency_weight, na.rm = TRUE))
  )

# ------------------------------------------------------------
# Scenario 4: Equity Allocation (structural vulnerability)
# ------------------------------------------------------------
df <- df %>%
  mutate(
    equity_weight =
      socioecon_vuln_index * community_vuln_index,

    equity_weight = pmax(equity_weight, 0),
    allocation_equity =
      TOTAL_BUDGET * (equity_weight / sum(equity_weight, na.rm = TRUE))
  )

# ------------------------------------------------------------
# Broadband Improvement Model (simple + bounded by gap)
# ------------------------------------------------------------

MAX_GAIN <- 0.25  # max 25 percentage-point improvement

df <- df %>%
  mutate(
    g_current    = allocation_current    / max(allocation_current),
    g_need       = allocation_need       / max(allocation_need),
    g_efficiency = allocation_efficiency / max(allocation_efficiency),
    g_equity     = allocation_equity     / max(allocation_equity),

    improve_current    = pmin(MAX_GAIN * g_current,    1 - airband_usage),
    improve_need       = pmin(MAX_GAIN * g_need,       1 - airband_usage),
    improve_efficiency = pmin(MAX_GAIN * g_efficiency, 1 - airband_usage),
    improve_equity     = pmin(MAX_GAIN * g_equity,     1 - airband_usage)
  )

# ------------------------------------------------------------
# Summary of improvements for each scenario
# ------------------------------------------------------------
scenario_summary <- df %>%
  summarise(
    total_improve_current    = sum(improve_current,    na.rm = TRUE),
    total_improve_need       = sum(improve_need,       na.rm = TRUE),
    total_improve_efficiency = sum(improve_efficiency, na.rm = TRUE),
    total_improve_equity     = sum(improve_equity,     na.rm = TRUE)
  )

scenario_summary
##   total_improve_current total_improve_need total_improve_efficiency
## 1              424.0604           14.09053                 10.66956
##   total_improve_equity
## 1             143.1332

The simulation results show clear differences across allocation strategies. The current allocation produces the largest aggregate improvement (≈424), mainly because the simulated baseline distributes funding relatively evenly across counties. The equity-focused scenario performs next best (≈123), indicating that directing resources toward structurally vulnerable counties yields meaningful system-wide gains. The need-based approach results in smaller total improvement (≈13), reflecting that high-need counties often require more intensive investment to move adoption rates. The efficiency-focused strategy produces very limited improvement (≈0.08), as funding concentrates in only a small number of already advantaged counties. Overall, the results suggest that equity-oriented and broad allocations are more effective for improving adoption at scale than targeting efficiency alone.

40.2 Priority Score

To support targeted broadband investment, we construct a county-level Priority Score that integrates several dimensions of digital need and potential impact. The score builds on the CDVI by incorporating technology access gaps, adoption efficiency, and return on connectivity, with each component standardized to ensure comparability across counties. Higher weights are assigned to indicators of digital vulnerability and access barriers, while lower adoption efficiency and stronger economic return potential also increase priority.

## ============================================================
## 7.1 County-Level Priority Score (for Policy Targeting)
## ============================================================

library(dplyr)

analysis_2020 <- analysis_2020 %>%
  mutate(
    # Put everything on comparable (z) scales, and flip efficiency so
    # low efficiency = higher need.
    ps_comp_cdvi     = as.numeric(scale(cdvi_score)),
    ps_comp_techgap  = as.numeric(scale(tech_access_gap)),
    ps_comp_eff_need = as.numeric(scale(-adoption_efficiency)),
    ps_comp_roc      = as.numeric(scale(return_on_connectivity)),

    # Weights: need-heavy, but still reward potential impact (ROC)
    priority_score_raw =
      0.40 * ps_comp_cdvi +
      0.25 * ps_comp_techgap +
      0.20 * ps_comp_eff_need +
      0.15 * ps_comp_roc,

    priority_score = as.numeric(scale(priority_score_raw)),

    priority_tier = case_when(
      priority_score >= quantile(priority_score, 0.75, na.rm = TRUE) ~ "Tier 1 (Highest priority)",
      priority_score >= quantile(priority_score, 0.50, na.rm = TRUE) ~ "Tier 2",
      priority_score >= quantile(priority_score, 0.25, na.rm = TRUE) ~ "Tier 3",
      TRUE                                                            ~ "Tier 4 (Lowest priority)"
    )
  )

# Quick check
table(analysis_2020$priority_tier)
## 
## Tier 1 (Highest priority)                    Tier 2                    Tier 3 
##                       780                       780                       780 
##  Tier 4 (Lowest priority) 
##                       780
head(analysis_2020[, c("GEOID", "NAME.x", "cdvi_tier", "priority_tier")])
## # A tibble: 6 × 4
##   GEOID NAME.x    cdvi_tier                       priority_tier            
##   <chr> <chr>     <chr>                           <chr>                    
## 1 31039 Cuming    Low–moderate (Tier 3)           Tier 2                   
## 2 53069 Wahkiakum Moderate vulnerability (Tier 2) Tier 2                   
## 3 35011 De Baca   High vulnerability (Tier 1)     Tier 1 (Highest priority)
## 4 31109 Lancaster Lower vulnerability (Tier 4)    Tier 4 (Lowest priority) 
## 5 31129 Nuckolls  Low–moderate (Tier 3)           Tier 3                   
## 6 46099 Minnehaha Lower vulnerability (Tier 4)    Tier 4 (Lowest priority)

The priority score generally aligns with the CDVI classification, confirming that counties with higher digital vulnerability also receive higher intervention priority. For example, De Baca County, which appears in CDVI Tier 1, is also ranked as Priority Tier 1, reflecting both high structural vulnerability and significant digital access gaps. Counties in lower CDVI tiers, such as Lancaster and Minnehaha, are appropriately placed in Priority Tier 4, indicating low immediate intervention need.

Some counties show moderate shifts across tiers (e.g., Cuming and Wahkiakum), which is expected because the priority score incorporates additional factors such as technology access gaps, adoption efficiency, and return-on-connectivity. Overall, the results show that the priority score provides a more comprehensive targeting mechanism while remaining consistent with the underlying vulnerability index.

## ============================================================
## Phase 7.3: Impact Assessment of Allocation Scenarios
## Using Mean Impact + % Positive Counties
## ============================================================

# df_cb already exists from earlier cost-benefit prep
# (impact_current, impact_need, impact_efficiency, impact_equity)

# -------------------------
# Phase 7.3: Impact Assessment
# -------------------------

library(dplyr)

# 1) Create df_cb from df and rename columns
df_cb <- df %>%
  select(
    GEOID, NAME.x, state_name,
    improve_current, improve_need, improve_efficiency, improve_equity
  ) %>%
  dplyr::rename(
    impact_current    = improve_current,
    impact_need       = improve_need,
    impact_efficiency = improve_efficiency,
    impact_equity     = improve_equity
  )

# 2) Mean Impact (per county)
impact_mean <- df_cb %>%
  summarise(
    mean_impact_current    = mean(impact_current,    na.rm = TRUE),
    mean_impact_need       = mean(impact_need,       na.rm = TRUE),
    mean_impact_efficiency = mean(impact_efficiency, na.rm = TRUE),
    mean_impact_equity     = mean(impact_equity,     na.rm = TRUE)
  )

# 3) % of counties with positive impact
impact_positive <- df_cb %>%
  summarise(
    pct_positive_current    = mean(impact_current    > 0, na.rm = TRUE),
    pct_positive_need       = mean(impact_need       > 0, na.rm = TRUE),
    pct_positive_efficiency = mean(impact_efficiency > 0, na.rm = TRUE),
    pct_positive_equity     = mean(impact_equity     > 0, na.rm = TRUE)
  )

# 4) Output results
impact_mean
##   mean_impact_current mean_impact_need mean_impact_efficiency
## 1           0.1359168      0.004516196             0.00341973
##   mean_impact_equity
## 1         0.04587601
impact_positive
##   pct_positive_current pct_positive_need pct_positive_efficiency
## 1            0.9977564         0.4371795               0.3666667
##   pct_positive_equity
## 1           0.7067308

40.3 8. Actionable Policy Recommendations

Policy Implications and Actionable Recommendations

This project developed a comprehensive county-level analytical framework integrating digital vulnerability assessment (CDVI), priority scoring, and policy simulation to evaluate alternative broadband allocation strategies. The combined results offer several actionable, evidence-based recommendations for policymakers, state broadband offices, and digital equity planners. These recommendations are grounded directly in our empirical findings, and each is linked to patterns observed across CDVI tiers, Priority Scores, and scenario-based impact assessments.

  1. Prioritize High-Vulnerability and High-Priority Counties for Early-Stage Intervention

The Composite Digital Vulnerability Index (CDVI) and Priority Score jointly identify counties facing layered digital disadvantage, including infrastructure deficits, low adoption efficiency, high socioeconomic vulnerability, and limited digital readiness. Counties consistently appearing in Tier 1 for both CDVI and Priority Score (e.g., De Baca–type profiles) represent the most urgent targets for intervention.

Recommendation: Establish a Tier 1 County Priority Program focusing on:

Infrastructure buildout in high structural-vulnerability regions

Device access, affordability support, and adoption programs in places with existing infrastructure but low usage

Multiyear, bundled interventions combining infrastructure, literacy, and affordability measures

This ensures limited resources are directed first to counties with the highest structural barriers.

  1. Adopt Need-Based and Equity-Focused Allocation as the Primary Funding Strategy

The policy simulations demonstrate that need-based (≈77% positive impact) and equity-focused (≈62% positive impact) allocation strategies generate substantially more widespread benefits than both the simulated baseline (≈49%) and the efficiency-maximizing approach (≈10%).

Recommendation: Implement funding formulas where:

60–70% of total resources are distributed according to need-based indicators (CDVI, tech access gap, infrastructure gap).

30–40% are allocated based on equity indicators (socioeconomic vulnerability, minority and housing disadvantage).

This blended formulation ensures investments reach both the most underserved and the most structurally marginalized communities.

  1. Match Intervention Type to the County’s Dominant Barrier

Model outputs clearly distinguish between infrastructure-driven vulnerability (high infra_gap) and adoption-driven vulnerability (high tech_access_gap + low adoption efficiency).

Recommendation: Use barrier-specific strategies:

Infrastructure-Deficit Counties (high infra_gap): Prioritize fiber/last-mile deployment, middle-mile expansion, and open-access networks.

Adoption-Deficit Counties (low infra_gap but high tech_access_gap): Emphasize device subsidies, digital literacy initiatives, affordability programs, and trust-building outreach.

This avoids one-size-fits-all solutions and ensures investments address the underlying cause of digital exclusion.

  1. Establish “Digital Opportunity Zones” in High-Skill, Low-Adoption Counties

Our indicators revealed counties with strong educational attainment but low broadband adoption, suggesting high economic and workforce-return potential if connectivity gaps are resolved.

Recommendation: Designate Digital Opportunity Zones in counties with:

High education_index or skill_capital

Below-average broadband usage

Targeted investments should include:

Digital skills training and upskilling programs

Small business digitization support

Remote work enablement and entrepreneurship pathways

These areas can yield rapid and sustained economic benefits with relatively modest intervention.

  1. Improve Adoption Efficiency Through Targeted Demand-Side Programs

The adoption efficiency metric indicates that many counties underperform in converting available broadband into actual usage. Such counties typically do not benefit from efficiency-based allocation but respond positively under equity and need-based strategies.

Recommendation: Implement targeted demand-side programs, particularly in counties with available infrastructure but limited uptake:

Local-language outreach

Library-based digital navigation services

Partnerships with schools, health providers, and community organizations to promote telehealth, education portals, and civic services

Enrollment assistance for low-cost broadband programs (ACP-like future equivalents)

  1. Use Priority Tiers to Structure a Multi-Year Rollout Strategy

Both the CDVI tiers and Priority Score tiers provide a clear roadmap for a phased rollout of interventions.

Recommendation: Adopt a 3–5 year staggered implementation plan aligned with priority tiers:

Years 1–2: Focus intensive interventions on Priority Tier 1 counties

Years 2–3: Expand to Tier 2 counties with mixed needs

Years 3–4: Support Tier 3 counties with lighter-touch programs

Ongoing: Monitor Tier 4 counties for emerging gaps

This approach balances urgency with feasibility, ensuring consistent progress while avoiding resource fragmentation.

  1. Integrate the Analytical Framework into Ongoing Governance

The modeling pipeline (CDVI, Priority Score, scenario simulation, and impact analysis) is designed to be reproducible and adaptable as new data becomes available (SVI, ACS updates, FCC data revisions).

Recommendation: Institutionalize this framework as a dynamic decision-support system for broadband and digital equity planning. Use it to:

Recompute vulnerability and priority tiers annually

Adjust funding strategies based on updated scenario impact patterns

Track county-level improvements in broadband usage and access gaps

Support evidence-based grant applications and BEAD/digital equity planning

40.4 Limitations

This analysis has several methodological and data-related limitations that must be acknowledged. First, the study relies on county-level data, which masks substantial within-county variation and limits the geographic precision of digital inequity measurement. Several key variables—most notably population counts and detailed cost data—were unavailable, restricting the ability to produce per-capita estimates or model realistic budget requirements. Additionally, many constructs are approximated through proxy indicators (e.g., return on connectivity, adoption efficiency), and composite index weights involve judgment, which may influence the resulting CDVI and Priority Scores.

Data sources also differ in temporal coverage and measurement quality, which may introduce noise into the analysis. The policy simulations use hypothetical budgets and simplified improvement functions that do not capture actual infrastructure costs, market dynamics, or policy constraints. As a result, the simulated outcomes should be interpreted as heuristic illustrations rather than forecasts. Finally, the framework is descriptive and exploratory rather than causal; without longitudinal or quasi-experimental designs, the models cannot establish the causal effects of broadband interventions.

Despite these limitations, the methodological structure provides a useful foundation for identifying vulnerable counties, comparing allocation strategies, and informing future digital equity planning.

40.5 Conclusion

Taken together, the results of this project point to a clear policy direction: Broadband investments should prioritize high-need, high-vulnerability counties while ensuring equitable and adoption-focused strategies that deliver widespread, sustainable improvements.

Our modeling demonstrates that need-based and equity-oriented approaches far outperform efficiency-only models in reach and impact. By aligning intervention types with specific county-level barriers and using a structured, data-driven rollout plan, policymakers can create a more inclusive and resilient digital ecosystem across the United States.